CN113627327A

CN113627327A - Singing voice detection method based on multi-scale time-frequency graph parallel input convolution neural network

Info

Publication number: CN113627327A
Application number: CN202110912362.8A
Authority: CN
Inventors: 桂文明
Original assignee: Jinling Institute of Technology
Current assignee: Jinling Institute of Technology
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2021-11-09

Abstract

The invention discloses a singing voice detection method based on a multi-scale time-frequency graph parallel input convolutional neural network. In general, in a singing voice detection algorithm based on a convolutional neural network, a network input layer is a two-dimensional time-frequency graph matrix, a plurality of two-dimensional time-frequency graph matrices with different scales are generated by adjusting the window length of short-time Fourier transform according to the multi-scale characteristics of a music signal, and then the time-frequency graphs are sent to the convolutional neural network in a parallel multi-channel mode, so that the neuron receptive fields of the convolutional neural network can simultaneously observe information of the music signal in multiple scales, the time-frequency graph characteristic extraction and resolution capability of neurons is enhanced, and the overall performance of singing voice detection is improved.

Description

Singing voice detection method based on multi-scale time-frequency graph parallel input convolution neural network

Technical Field

The invention relates to the technical field of music artificial intelligence, in particular to a singing voice detection method based on a multi-scale time-frequency graph parallel input convolution neural network.

Background

Regarding the background art of singing voice examination, the applicant has described both a singing voice detection method based on an extrusion and excitation residual network (application No.: CN202010164594.5) and a singing voice detection method based on a dot product self-attention convolution neural network (patent No.: ZL 202110192300.4). Singing Voice Detection (SVD) is a process of determining whether each small segment of audio in digital music contains Singing Voice, and the Detection precision is generally between 50-200 milliseconds. Singing voice detection is important fundamental work in the field of Music Information Retrieval (MIR), and many other research directions such as singer identification, singing voice separation, lyric alignment and the like require singing voice detection as a prerequisite technology or an enhancement technology. In music, in addition to singing voice, the sound of musical instruments is generally contained, and although it is easy for a person to judge whether or not there is singing voice in a music piece in which musical instruments and singing voice are mixed, it is still a challenging task for a machine at present.

The singing voice detection process generally comprises preprocessing, feature extraction, classification, post-processing and the like, wherein the feature extraction and the classification are two most important steps. In the feature extraction process, the simplest and most common feature is a time-frequency graph after short-time Fourier transform, and the deformation of the time-frequency graph comprises a Mel time-frequency graph and a logarithmic Mel time-frequency graph. Other features are typically extracted based on time-Frequency-graph processing, such as mel-Frequency Cepstral coefficients (mfccs), (mel Frequency Cepstral coeffients), kinetic Spectral features (Fluctogram), Spectral Flatness factor (Spectral Flatness), Spectral shrinkage factor (Spectral contrast), and so on; in the classification process, the main classification method comprises a method based on a traditional classifier and a method based on a Deep Neural Network (DNN), wherein the method comprises a support Vector SVM (support Vector machine), a Hidden Markov Model (HMM), a random forest RF (random forest) and the like; the latter includes methods using convolutional Neural networks cnn (convolutional Neural network) and recurrent Neural networks rnn (recurrent Neural network).

Aiming at the singing voice detection problem, the applicant applies for a singing voice detection method based on an extrusion and excitation residual error network, and the application number is as follows: CN202010164594.5, the invention proposes a singing voice detection method based on squeezing and excitation residual error network. The method comprises the following steps: constructing an extrusion and excitation residual error network; constructing a music data set; converting the music data set into an image set; respectively training the constructed networks by using training image sets; respectively testing the trained networks by using the test image set; selecting the network with the highest test accuracy as a final singing voice detection network; singing voice detection is performed on the detected audio file using the selected network. The singing voice features of different levels are implicitly extracted through a depth residual error network, the importance of the features can be judged by utilizing the self-adaptive attention characteristics of an embedded extrusion and excitation module, under the condition that the depths of the features are 14, 18, 34, 50, 101, 512 and 200 respectively under a JMD data set, the average value of the detection accuracy is 88.19, and the effect still needs to be improved. In addition, the network stacking mode consumes more computing resources and the training time is long.

In order to solve the singing voice detection problem, the applicant also applies a singing voice detection method based on a dot product self-attention convolution neural network, and the patent number is as follows: ZL202110192300.4, the invention provides a singing voice detection method based on a dot product self-attention convolutional neural network, a dot product self-attention module is embedded in the convolutional neural network, the embedding method is that after two convolutional group modules, the dot product self-attention module is respectively embedded to carry out attention weight re-estimation on characteristics output by the dot product self-attention module, and a feature map after re-estimation is sent to the next layer of the network, the attention distribution of the characteristics learned by the convolutional network in the network is not the same any more, and the attention re-estimation mechanism enables the characteristics to be treated differently by the network, so that the overall network performance is improved. In addition, the dot product self-attention module improves the traditional dot product self-attention model applied to machine translation, firstly, the lengths of the vector key value pair < k, v > and the query vector q are unequal, secondly, the expression meanings of q, k and v are redefined, and an attention distribution transformation mechanism is added again.

The invention considers the problem of improving the detection performance by improving the network input layer in the singing voice detection algorithm based on the convolutional neural network. In a general singing voice detection algorithm based on a convolutional neural network, a network input layer is a time-frequency graph matrix, and the time-frequency graph is obtained by windowing a music signal with a certain length and performing Fourier transform, namely a time-frequency graph with a scale. Although a time-frequency graph of a certain scale extracts typical features of the original signal, analysis of some problems may be enough, but the time-frequency graph only retains information of only one scale, and some problems need more scales of information because more scales of information are more beneficial to analyzing the problem. The essence of the short-time Fourier transform is cosine-based matching signals intercepted by a window function, and when the window length and the signal matching degree are high, the signals can be more accurately represented, so that when a single-scale time-frequency diagram cannot meet the signal analysis requirement, a multi-scale time-frequency diagram is provided, and the signals can be more favorably analyzed. Fig. 1 is a time-frequency diagram of a song with a song name in two different scales, and it can be seen from the diagram that the time-frequency diagram with the scale of 2048 (lower diagram) is clearer than the time-frequency diagram with the scale of 512 (upper diagram), which shows that the time-frequency diagram of the song expresses information more accurately in the scale of 2048, and at this time, if the information of two scales is integrated, it is obviously more beneficial to signal analysis. According to the principle, the invention firstly generates a plurality of two-dimensional time-frequency diagram matrixes with different scales by adjusting the window length of short-time Fourier transform, and then sends the time-frequency diagrams into the convolutional neural network in a parallel multi-channel mode, so that the neuron receptive field of the convolutional neural network can simultaneously observe information of a plurality of scales of music signals, thereby enhancing the time-frequency diagram feature extraction and resolution capability of neurons and improving the overall performance of singing voice detection.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a singing voice detection method based on a multi-scale time-frequency graph parallel input convolution neural network aiming at the defects in the prior art, and on one hand, the neural network can simultaneously observe information of multiple scales of music signals through the multi-channel multi-scale time-frequency graph parallel input, so that the resolving power of high-frequency and low-frequency parts is adjusted, the characteristics of the singing voice are accurately extracted, and the integral performance of the singing voice detection is improved; on the other hand, the multi-channel data corresponds to a music signal with the same classification, is transverse data amplification and has a promoting effect on improving the detection accuracy.

The technical scheme is as follows: in order to achieve the purpose, the invention provides a singing voice detection method based on a multi-scale time-frequency graph parallel input convolutional neural network, which is characterized by comprising the following steps: the method comprises the following specific steps:

step 1: for single music filePerforming short-time Fourier transform by different window lengths w_i，i∈[1..n]To obtain time-frequency graphs F with different scales_i，i∈[1..n]And stored in the form of n data files;

step 2: setting training, verifying and testing data sets, wherein each data set comprises singing voice marking information of corresponding music;

1) carrying out short-time Fourier transform on a single music file in each data set according to the step 1 to obtain time-frequency diagram files with n scales, wherein if the data sets totally contain m music files, the total number of the generated time-frequency diagram files is mxn;

2) performing matrix data slicing operation on time-frequency graph files of a training, verifying and testing data set on a time axis, wherein the line number of a slicing matrix is kept the same as that of the time-frequency graph files, each slicing matrix corresponds to a small image, the length and the width of each small image are set as h and w, in order to keep the continuity of data, the data of the slicing matrix has certain repetition, therefore, the interval time hop of slicing is less than the width of the matrix, zero padding operation is performed on the last matrix of the time-frequency graph files with the width less than w, the sliced small images are sequentially sequenced and numbered according to music files, and all the small images of the training, verifying and testing set are respectively represented as T_i，j，V_i，k，U_i，lWherein i represents scale serial number, j, k, l respectively represents small image serial number in training, verifying and testing data set, parameters h, w and hop are kept the same when time-frequency files of different scales of the same music are subjected to matrix slicing, therefore, time points corresponding to small images of different scales are the same, and small image combinations of all scales of the same time point are marked as

Wherein the small images are single channel data;

3) computing all small image data in training, verifying and testing data set

And in a matrix M_max，M_minPreservation ofAs a parameter for normalization operation of the small image data;

4) in a matrix M_max，M_minAs parameters, for all small images

Carrying out maximum and minimum normalization to obtain small image combination

5) Combining small images

And (3) performing three-channel gray image conversion, wherein the value of the converted image data is between 0 and 255, although the three-channel data of the gray image are the same, the three-channel gray image is more intuitive data representation simulating macroscopic view, and two channels of data are added, so that the dimensionality of the features is increased, and the data extraction by a neural network is more facilitated to a certain extent. The converted small image combination is recorded as

Where each small image is three channel data.

6) Computing

The mean and variance of all small image data in (1), where the mean and variance are the summary information of all small image data of each channel, are different from the form of the 3) step matrix, because each channel only summarizes one mean and variance, the mean and variance have only 3 equal values, denoted as u, σ, respectively.

7) To pair

Standardized by the parameters u, sigma, converted into small image combinations to be input into the convolutional neural network

8) Calculating each multi-scale multi-channel small image combination according to the singing voice marking information of the music

Corresponding marking information y_j，y_k，y_l；

And step 3: constructing a singing voice detection network based on a convolutional neural network and provided with n scale small image inputs, wherein the number of input channels is 3 multiplied by n;

the structure diagram of the convolutional neural network comprises four components:

the first part is an input layer, where the input layer has 3 × n input channels;

the second part and the third part have the same structure and are channel attention convolutional layers which respectively consist of 2 BN convolutional blocks, 1 maximum value pooling layer and 1 SEBlock channel attention layer;

the structure of the BN convolution block and the SEColck is characterized in that the BN convolution block consists of 1 3 multiplied by 3 convolution, 1 BatchNorm layer and a Relu unit; SEBlock is a squeezing and exciting module, assuming that the convolution output F of the previous layer is a picture with the height and width of h multiplied by w, the number of channels is c, squeezing operation is a global tie pooling layer, and c channels are compressed into c descriptors; the first step of the excitation operation is a door mechanism, and specifically comprises that a first full-connection layer reduces dimensions of c descriptors by r times, then a Relu function is used for carrying out nonlinear transformation, and then a second full-connection layer multiplies the dimensions by r; the second step of excitation operation is that firstly, a Sigmod activation function is used for carrying out weight estimation on channels, then, the channels are adjusted according to the weight estimation through Scale operation, finally, the adjusted channels F' enter a next layer of network, SEBlock enables the action of the channels on the next layer of network to be changed, the weights are not equal any more, but are obtained through learning, and the process is essentially a learning and distributing process of channel attention; the fourth part is a feature vector extraction layer which comprises 3 full-connection layers and 2 Dropout layers, wherein the full-connection layers store high-level information extracted by the previous convolutional layer, the dimension is further reduced in a feature vector mode, finally output one-dimensional data determines whether singing voice segments corresponding to the input n-scale time-frequency graphs contain singing voice or not, the output one-dimensional data is converted into probability values by a Sigmod function, and then the loss of training is calculated by a weighted binary cross entropy loss function;

and 4, step 4: training and testing, and counting the evaluation result;

1) small image combination of training data set obtained from step 2

Randomly extracting a batch of b small image combinations

And the corresponding label y_s，s＝[1..b]Inputting the training data into the neural network in the step 3, and randomly extracting b small image combinations from the rest data sets again after one batch of training is finished until the data of all the training data sets are extracted completely, and finishing one round of training; stopping training and entering the test if the number of training rounds reaches the set limit number;

2) small image combinations of the set validation data sets obtained from step 2

In the sequence, a batch of b small image combinations are taken out

And the corresponding label y_s，s＝[1..b]Inputting the data into the neural network in the step 3 for verification to obtain a prediction result of the batch; after one batch of verification is finished, sequentially extracting b small image combinations from the rest data sets, and finishing one-time verification until all the verification data sets are completely extracted; after one-time verification is finished, obtaining the accuracy of a prediction result, stopping training if the accuracy is not improved for e times continuously, otherwise, continuing to execute the step 1) for training;

3) combination of small images from the test dataset obtained in step 2

In the sequence, a batch of b small image combinations are taken out

And the corresponding label y_s，s＝[1..b]Inputting the data into the neural network in the step 3 for testing to obtain a prediction result of the batch; after one batch of verification is finished, sequentially extracting b small image combinations from the rest data sets until the data of all the test data sets are extracted;

4) after the test is finished, firstly, calculating the evaluation index of the singing voice detection of each song, and then calculating the average value of the indexes of all songs to be the evaluation result of the test;

if the prediction result is singing voice, the result is called positive direction P, and if the prediction result is not singing voice, the result is called negative direction N; by comparing with singing voice labels in the data set, if the prediction result is correct, the result is marked as T, if the prediction result is wrong, the result is marked as F, and therefore the number of samples O of the prediction result is predicted_tp，O_fp，O_fn，O_fpRespectively recording as:

O_tp: predicting the total number of samples of the forward P as the correct T of the prediction result;

O_tn: predicting the total number of negative N samples when the prediction result is correct T;

O_fp: predicting the total number of samples of the forward P, namely the total number of false-reported samples, to be wrong F;

O_fn: predicting the total number of negative N samples, namely the total number of missed samples, if the prediction result is wrong F;

for each song, an accuracy A, an accuracy P, a recall R and an F value are calculated respectively, wherein the F value is the integration of the accuracy P and the recall R:

as a further improvement of the invention, the time-frequency diagram calculation process in step 1 comprises:

1) setting the Window Length w_i，i∈[1..n]Calculating a short-time Fourier transform S for the music file x_i＝stft(x，w_i) When short-time Fourier transform is calculated, half of the window length is respectively filled on the left side and the right side of the data sequence of x, so that the time of the frame numbers corresponding to the time-frequency graphs with different scales is consistent, the singing voice labeling time corresponding to the time-frequency graphs is kept consistent, and therefore only one classification result is obtained after the time-frequency graphs are input to a convolutional neural network in parallel;

2) to S_iFrequency of (D) is normalized by Mel scale, M_i＝mel(S_i)；

3) To M_iTaking logarithm of the coefficient to obtain a time-frequency diagram F_i＝todb(M_i)；

4) Time-frequency diagram F_iActually, the music file is a two-dimensional matrix, the rows of the matrix represent Mel frequency serial numbers, the columns of the matrix correspond to the music proceeding time, the data of the matrix are stored in a file form for further processing, and for a single music file, n time-frequency diagram files with different scales and corresponding time-frequency diagram files exist.

As a further improvement of the invention, the maximum minimum matrix M in step 2_max，M_minThe calculation of (c) is expressed by the following formula:

wherein M is_max，M_minThe maximum and minimum values of all small image pixel positions in the data set including training, validation and testing are stored in a matrix form.

As a further improvement of the present invention, the operation of the BN layer in step 2 is expressed by the following formula:

wherein x_iFor the input of BN layer, formula (1) calculates the mean of the batch samples, formula (2) calculates the variance, formula (3) normalizes the samples, formula (4) adds two trainable parameters γ and β to enhance expression, z_iIs the output of the BN layer.

The singing voice detection method based on the multi-scale time-frequency graph parallel input convolutional neural network has the beneficial effects that:

1) the evaluation result of the algorithm of the application is better than the evaluation results of the algorithms of the traditional CNN method, the patent number ZL202110192300.4 and the application number CN202010164594.5 when the experiment is carried out under the open data set Jamendo (JMD for short). In the experiment, keeping the training, verification and test set division of each algorithm execution process JMD involved in comparison the same, each algorithm is executed 3 times respectively, and calculating the average value of the percentage of the two indexes of the accuracy a and F as the evaluation result, wherein the patent algorithm with application number CN202010164594.5 executes 3 times of the average evaluation result of a and F under the condition of the depth of 14, 18, 34, 50, 101, 512, 200 respectively, and the result is shown in table 1 below:

table 1 comparison of results of the algorithm of the present application and other algorithms

The algorithm provided by the application is respectively 3.91 and 3.25 percent higher than the indexes A and F of the traditional CNN method, 2.26 and 1.72 percent higher than the indexes A and F of the patent CN202010164594.5, and 2.09 and 1.89 percent higher than the indexes A and F of the patent ZL 202110192300.4.

2) The application provides a singing voice detection method of a convolutional neural network based on multi-scale time-frequency graph parallel input, through multi-channel multi-scale time-frequency graph parallel input, on one hand, the neural network can observe information of multiple scales of music signals at the same time, so that the network can adjust the resolution capability of high-frequency and low-frequency parts, accurately extract the characteristics of the singing voice and improve the overall performance of singing voice detection; on the other hand, the multi-channel data corresponds to a music signal with the same classification, is transverse data amplification and has a promoting effect on improving the detection accuracy. The multi-channel input mode is completely different from the two patents of patent numbers ZL202110192300.4 and CN202010164594.5 applied by the applicant in the prior art, and has the function of improving the overall performance.

3) The time-frequency graph of each scale provided by the application is converted by a three-channel gray-scale graph, the converted small image is changed into standard image data of three channels from one original channel, and the data size is between 0 and 255. Although the three-channel data of the gray level image are the same, the three-channel gray level image is more visual image data representation simulating naked eyes, and the data of two channels are added, so that the dimensionality of the features is increased, and the data extraction by a neural network is more facilitated.

4) In the application, before small image data in training, verifying and testing data sets are converted into three-channel gray-scale images, global normalization is performed, wherein a maximum and minimum normalization method is adopted in the global normalization, and the used maximum and minimum values are the maximum and minimum values of pixel levels instead of summarizing the maximum and minimum values of all pixels. The normalization mode can be used for normalizing data and simultaneously keeping the data property of the pixel level, and is beneficial to improving the integral effect of singing voice detection. The experiment is carried out under a public data set JMD, under the same condition, the method is adopted to carry out global normalization ratio without normalization and to carry out normalization comparison by adopting a summarizing maximum and minimum value method, the indexes A and F are respectively higher by 0.92 and 0.81 percent than those of the summarizing maximum and minimum value method, and the indexes A and F are respectively higher by 2.08 and 2.04 percent than those of the summarizing maximum and minimum value method. The experimental data are tabulated below:

TABLE 2 comparison of evaluation results of normalization methods

5) In the neural network design, BN (BatchNorm) layers are designed in the second part, the third part and the fourth part, so that on one hand, the problem of Internal Covariate Shift is solved, and a larger learning rate can be adopted in the training process to accelerate convergence; BN, on the other hand, alleviates the problems of gradient disappearance and gradient explosion; moreover, BN increases the generalization ability to some extent. Due to the addition of the BN layer, the performance of the neural network is improved, and the accuracy of singing voice detection is improved.

6) The BN layer is matched with the SEBlock channel attention reestimation mechanism, so that the overall effect of the neural network is better. Compared with a neural network designed with a patent number ZL202110192300.4, the main different points comprise a multi-scale multi-channel parallel input, a BN layer and an SEBlock; in contrast to the patent with application number CN202010164594.5, except for two main differences: outside the parallel input of multiscale multichannel and BN layer, though also adopt SEBlock, nevertheless do not pile up the SEBlock network, pile up the SEBlock and can reach the effect that promotes the detection accuracy to a certain extent, but this can lead to training time long, inefficiency, the time that this application patent's 200 layer network trained a round under JMD reaches 103 minutes, and this application training round only needs 11 minutes, and the training time is 9.4 times of this application. Under the condition of short training time, the accuracy and the F value of the method are respectively improved by 2.26 percent and 1.72 percent compared with the method.

Drawings

FIG. 1 is a time-frequency plot of two dimensions of a song;

FIG. 2 is a BN convolution block structure and SEBlock structure;

fig. 3 is a schematic diagram of singing voice detection.

Detailed Description

The present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as merely illustrative of the invention and not as limiting the scope of the invention. It should be noted that, as used in the following description, the terms "front", "rear", "left", "right", "upper" and "lower" refer to directions in the drawings, and the terms "inner" and "outer" refer to directions toward and away from, respectively, the geometric center of a particular component.

As a specific embodiment, the invention provides a singing voice detection method based on a multi-scale time-frequency graph parallel input convolutional neural network, which comprises the following specific steps:

step 1: short-time Fourier transform of a single music file through different window lengths w_i，i∈[1..n]To obtain time-frequency graphs F with different scales_i，i∈[1..n]And stored in the form of n data files;

the time-frequency diagram calculation process comprises the following steps:

1) setting the Window Length w_i，i∈[1..n]Calculating a short-time Fourier transform S for the music file x_i＝stft(x，w_i). When calculating the short-time Fourier transform, half of the window length is filled in the left and right of the data sequence of x, so that a plurality of time intervals with different scales can be obtainedThe time of the frame serial numbers corresponding to the frequency graphs is consistent, so that the singing voice labeling time corresponding to the time frequency graphs is kept consistent, and only one classification result is obtained after the time frequency graphs are input to the convolutional neural network in parallel;

2) to S_iFrequency of (D) is normalized by Mel scale, M_i＝mel(S_i)；

4) Time-frequency diagram F_iActually, the music playing device is a two-dimensional matrix, the rows of the matrix represent Mel frequency serial numbers, the columns of the matrix correspond to the music playing time, and the data of the matrix are stored in a file form for further processing. For a single music file, there are n time-frequency diagram files of different scales and their correspondences.

2) performing matrix data slicing operation on time-frequency graph files of a training, verifying and testing data set on a time axis, wherein the line number of a slicing matrix is kept the same as that of the time-frequency graph files, each slicing matrix corresponds to a small image, the length and the width of each small image are set as h and w, in order to keep the continuity of data, the data of the slicing matrix has certain repetition, therefore, the interval time hop of slicing is less than the width of the matrix, zero padding operation is performed on the last matrix of the time-frequency graph files with the width less than w, the sliced small images are sequentially sequenced and numbered according to music files, and all the small images of the training, verifying and testing set are respectively represented as T_i，j，V_i，k，U_i，lWherein i represents a scale serial number, j, k, l respectively represent a small image serial number in the training, verifying and testing data set, and parameters h, w and l are used for matrix slicing of time-frequency files with different scales of the same musicThe hop remains the same, so the time points corresponding to the small images at different scales are the same, and the combination of the small images at all scales at the same time point is recorded as

Wherein the small images are single channel data;

3) computing all small image data in training, verifying and testing data set

And in a matrix M_max，M_minStoring the small image data as a parameter for normalization operation;

4) in a matrix M_max，M_minAs parameters, for all small images

5) Combining small images

Where each small image is three channel data.

6) Computing

All small image dataThe mean and variance of (1), where the mean and variance are the summary information of all small image data for each channel, are different from the 3) step matrix form, because each channel only summarizes one mean and variance, the mean and variance have only 3 equal values, denoted as u, σ, respectively.

7) To pair

Corresponding marking information y_j，y_k，y_l；

And step 3: constructing a singing voice detection network based on a convolutional neural network and provided with input channels of n scale time-frequency graphs;

the structure diagram of the convolutional neural network constructed by the invention is shown in fig. 2, and comprises four components: the first part is an input layer, where the input layer has 3 × n input channels; the second part and the third part have the same structure and are channel attention convolutional layers which respectively consist of 2 BN convolutional blocks, 1 maximum value pooling layer and 1 SEBlock channel attention layer; the structure of the BN convolution block and the SEColck is characterized in that the BN convolution block consists of 1 3 multiplied by 3 convolution, 1 BatchNorm layer and a Relu unit; the SEBLICK is a squeezing and exciting module, assuming that the convolution output F of the previous layer is a picture with the height and width of h multiplied by w, the number of channels is c, the squeezing operation is a global tie pooling layer, and c channels are compressed into c descriptors; the first step of the excitation operation is a door mechanism, and specifically comprises that a first full-connection layer reduces dimensions of c descriptors by r times, then a Relu function is used for carrying out nonlinear transformation, and then a second full-connection layer multiplies the dimensions by r; the second step of excitation operation is that firstly, a Sigmod activation function is used for carrying out weight estimation on channels, then, the channels are adjusted according to the weight estimation through Scale operation, finally, the adjusted channels F' enter a next layer of network, SEBlock enables the action of the channels on the next layer of network to be changed, the weights are not equal any more, but are obtained through learning, and the process is essentially a learning and distributing process of channel attention; the fourth part is a feature vector extraction layer which comprises 3 full-connection layers and 2 Dropout layers, wherein the full-connection layers store high-level information extracted by the previous convolutional layer, the dimension is further reduced in a feature vector mode, finally output one-dimensional data determines whether singing voice segments corresponding to the input n-scale time-frequency graphs contain singing voice or not, the output one-dimensional data is converted into probability values by a Sigmod function, and then the loss of training is calculated by a weighted binary cross entropy loss function;

in the neural network design, BN (BatchNorm) layers are designed in the second part, the third part and the fourth part, so that on one hand, the problem of Internal Covariate Shift is solved, and a larger learning rate can be adopted in the training process to accelerate convergence; BN, on the other hand, alleviates the problems of gradient disappearance and gradient explosion; moreover, BN increases the generalization ability to some extent. Due to the addition of the BN layer, the performance of the neural network is improved, and the accuracy of singing voice detection is improved. The core idea of the BN layer is that data output by each layer of the neural network is normalized into standard normal distribution to solve the problem of the Internal Covariate Shift, and at the moment, because the normalization can weaken the expression capability of the network to a certain extent, two trainable parameters are added to enhance the expression capability of the network.

The operation of the BN layer can be expressed by the following formula:

wherein x_iFor the input of the BN layer, formula (1) calculates the mean of the batch of samples, formula (2) calculates the variance, formula (3) normalizes the samples, and formula (4) adds two trainable parameters γ and β to enhance expression. z is a radical of_iIs the output of the BN layer.

And 4, step 4: training and testing, and counting the evaluation result.

1) Randomly extracting a batch (batch) b small image combinations from all multi-channel small image combinations in the training data set in the step 2

2) sequentially taking out a batch of b small image combinations from all multi-channel small image combinations in the verification data set in the step 2

And the corresponding label y_s，s＝[1..b]Inputting the data into the neural network in the step 3 for verification to obtain a prediction result of the batch; after one batch of verification is finished, sequentially extracting b small image combinations from the rest data sets, and finishing one-time verification until all the verification data sets are completely extracted; after the primary verification is finished, the accuracy of the prediction result can be obtained. If the accuracy rate is not improved for e times continuously, stopping training, otherwise, continuing to execute the step 1) for training.

3) Sequentially taking out a batch of b small image combinations from all multi-channel small image combinations in the test data set in the step 2

And the corresponding label y_s，s＝[1..b]Inputting the data into the neural network in the step 3 for testing to obtain a prediction result of the batch; and after one batch of verification is finished, sequentially extracting b small image combinations from the rest data sets until the data extraction of all the test data sets is finished.

4) After the test is finished, firstly, the evaluation index of the singing voice detection of each song is calculated, and then the average value of the indexes of all songs is calculated to be the evaluation result of the test. If the prediction is singing, we call positive P (Positive), and if the prediction is not singing, we call negative N (negative); comparing the predicted result with the singing voice label in the data set, if the predicted result is correct, recording the result as T (true), and if the predicted result is wrong, recording the result as F (false), and predicting the sample number O of the result_tp，O_fp，O_fn，O_fpRespectively recording as:

for each song, the accuracy rate a (accuracy), precision rate p (precision), recall rate r (recall), and F-value (F-measure) are calculated separately, where the F-value is the combination of precision rate p (precision) and recall rate r (recall):

the above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The singing voice detection method based on the multi-scale time-frequency graph parallel input convolution neural network is characterized by comprising the following steps: the method comprises the following specific steps:

2) performing matrix data fragmentation operation on time-frequency graph files of training, verifying and testing data sets on a time axis, keeping the number of rows of fragmentation matrixes the same as that of the time-frequency graph files, and enabling each fragmentation matrix to correspond to one timeSetting the length and width of the small images as h and w, in order to keep the continuity of the data, the data of the fragmentation matrix has certain repetition, therefore, the interval time hop of the fragmentation is smaller than the width of the matrix, zero filling operation is carried out on the matrix of which the last fragmentation width of the time-frequency image file is smaller than w, the small images after the fragmentation are sequenced and numbered according to the sequence of the music files, and all the small images of the training, verification and test set are respectively represented as T_i，j，V_i，k，U_i，lWherein i represents scale serial number, j, k, l respectively represents small image serial number in training, verifying and testing data set, parameters h, w and hop are kept the same when time-frequency files of different scales of the same music are subjected to matrix slicing, therefore, time points corresponding to small images of different scales are the same, and small image combinations of all scales of the same time point are marked as

Wherein the small images are single channel data;

3) computing all small image data in training, verifying and testing data set

4) in a matrix M_max，M_minAs parameters, for all small images

5) Combining small images

Three-channel gray scale image conversion is carried out, and the value of the converted image data is largeThe size of the data is between 0 and 255, and although the three channels of the gray image have the same data, the three-channel gray image is more intuitive data representation simulating macroscopic view, and the data of two channels are added, so that the dimensionality of the features is increased, and the data are more beneficial to feature extraction of the data by a neural network to a certain extent. The converted small image combination is recorded as

Where each small image is three channel data.

6) Computing

7) To pair

Corresponding marking information y_j，y_k，y_l；

and 4, step 4: training and testing, and counting the evaluation result;

1) small image combination of training data set obtained from step 2

Randomly extracting a batch of b small image combinations

In the sequence, a batch of b small image combinations are taken out

3) combination of small images from the test dataset obtained in step 2

In the sequence, a batch of b small image combinations are taken out

if the prediction result is singing voice, the result is called positive direction P, and if the prediction result is not singing voice, the result is called negative direction N; by comparing with singing voice labels in the data set, if the prediction result is correct, the result is marked as T, if the prediction result is wrong, the result is marked as F, and therefore the number Q of samples of the prediction result is predicted_tp，O_fp，O_fn，O_fpRespectively recording as:

2. the singing voice detection method based on the multi-scale time-frequency graph parallel input convolutional neural network as claimed in claim 1, characterized in that: the time-frequency diagram calculation process in the step 1 comprises the following steps:

2) to S_iFrequency of (D) is normalized by Mel scale, M_i＝mel(S_i)；

3. The singing voice detection method based on the multi-scale time-frequency graph parallel input convolutional neural network as claimed in claim 1, characterized in that: maximum and minimum matrix M in step 2_max，M_minThe calculation of (c) is expressed by the following formula:

4. The singing voice detection method based on the multi-scale time-frequency graph parallel input convolutional neural network as claimed in claim 1, characterized in that: the operation of the BN layer in step 2 is expressed by the following formula: