CN113627327A - Singing voice detection method based on multi-scale time-frequency graph parallel input convolution neural network - Google Patents

Singing voice detection method based on multi-scale time-frequency graph parallel input convolution neural network Download PDF

Info

Publication number
CN113627327A
CN113627327A CN202110912362.8A CN202110912362A CN113627327A CN 113627327 A CN113627327 A CN 113627327A CN 202110912362 A CN202110912362 A CN 202110912362A CN 113627327 A CN113627327 A CN 113627327A
Authority
CN
China
Prior art keywords
time
data
frequency
singing voice
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110912362.8A
Other languages
Chinese (zh)
Inventor
桂文明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinling Institute of Technology
Original Assignee
Jinling Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinling Institute of Technology filed Critical Jinling Institute of Technology
Priority to CN202110912362.8A priority Critical patent/CN113627327A/en
Publication of CN113627327A publication Critical patent/CN113627327A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention discloses a singing voice detection method based on a multi-scale time-frequency graph parallel input convolutional neural network. In general, in a singing voice detection algorithm based on a convolutional neural network, a network input layer is a two-dimensional time-frequency graph matrix, a plurality of two-dimensional time-frequency graph matrices with different scales are generated by adjusting the window length of short-time Fourier transform according to the multi-scale characteristics of a music signal, and then the time-frequency graphs are sent to the convolutional neural network in a parallel multi-channel mode, so that the neuron receptive fields of the convolutional neural network can simultaneously observe information of the music signal in multiple scales, the time-frequency graph characteristic extraction and resolution capability of neurons is enhanced, and the overall performance of singing voice detection is improved.

Description

Singing voice detection method based on multi-scale time-frequency graph parallel input convolution neural network
Technical Field
The invention relates to the technical field of music artificial intelligence, in particular to a singing voice detection method based on a multi-scale time-frequency graph parallel input convolution neural network.
Background
Regarding the background art of singing voice examination, the applicant has described both a singing voice detection method based on an extrusion and excitation residual network (application No.: CN202010164594.5) and a singing voice detection method based on a dot product self-attention convolution neural network (patent No.: ZL 202110192300.4). Singing Voice Detection (SVD) is a process of determining whether each small segment of audio in digital music contains Singing Voice, and the Detection precision is generally between 50-200 milliseconds. Singing voice detection is important fundamental work in the field of Music Information Retrieval (MIR), and many other research directions such as singer identification, singing voice separation, lyric alignment and the like require singing voice detection as a prerequisite technology or an enhancement technology. In music, in addition to singing voice, the sound of musical instruments is generally contained, and although it is easy for a person to judge whether or not there is singing voice in a music piece in which musical instruments and singing voice are mixed, it is still a challenging task for a machine at present.
The singing voice detection process generally comprises preprocessing, feature extraction, classification, post-processing and the like, wherein the feature extraction and the classification are two most important steps. In the feature extraction process, the simplest and most common feature is a time-frequency graph after short-time Fourier transform, and the deformation of the time-frequency graph comprises a Mel time-frequency graph and a logarithmic Mel time-frequency graph. Other features are typically extracted based on time-Frequency-graph processing, such as mel-Frequency Cepstral coefficients (mfccs), (mel Frequency Cepstral coeffients), kinetic Spectral features (Fluctogram), Spectral Flatness factor (Spectral Flatness), Spectral shrinkage factor (Spectral contrast), and so on; in the classification process, the main classification method comprises a method based on a traditional classifier and a method based on a Deep Neural Network (DNN), wherein the method comprises a support Vector SVM (support Vector machine), a Hidden Markov Model (HMM), a random forest RF (random forest) and the like; the latter includes methods using convolutional Neural networks cnn (convolutional Neural network) and recurrent Neural networks rnn (recurrent Neural network).
Aiming at the singing voice detection problem, the applicant applies for a singing voice detection method based on an extrusion and excitation residual error network, and the application number is as follows: CN202010164594.5, the invention proposes a singing voice detection method based on squeezing and excitation residual error network. The method comprises the following steps: constructing an extrusion and excitation residual error network; constructing a music data set; converting the music data set into an image set; respectively training the constructed networks by using training image sets; respectively testing the trained networks by using the test image set; selecting the network with the highest test accuracy as a final singing voice detection network; singing voice detection is performed on the detected audio file using the selected network. The singing voice features of different levels are implicitly extracted through a depth residual error network, the importance of the features can be judged by utilizing the self-adaptive attention characteristics of an embedded extrusion and excitation module, under the condition that the depths of the features are 14, 18, 34, 50, 101, 512 and 200 respectively under a JMD data set, the average value of the detection accuracy is 88.19, and the effect still needs to be improved. In addition, the network stacking mode consumes more computing resources and the training time is long.
In order to solve the singing voice detection problem, the applicant also applies a singing voice detection method based on a dot product self-attention convolution neural network, and the patent number is as follows: ZL202110192300.4, the invention provides a singing voice detection method based on a dot product self-attention convolutional neural network, a dot product self-attention module is embedded in the convolutional neural network, the embedding method is that after two convolutional group modules, the dot product self-attention module is respectively embedded to carry out attention weight re-estimation on characteristics output by the dot product self-attention module, and a feature map after re-estimation is sent to the next layer of the network, the attention distribution of the characteristics learned by the convolutional network in the network is not the same any more, and the attention re-estimation mechanism enables the characteristics to be treated differently by the network, so that the overall network performance is improved. In addition, the dot product self-attention module improves the traditional dot product self-attention model applied to machine translation, firstly, the lengths of the vector key value pair < k, v > and the query vector q are unequal, secondly, the expression meanings of q, k and v are redefined, and an attention distribution transformation mechanism is added again.
The invention considers the problem of improving the detection performance by improving the network input layer in the singing voice detection algorithm based on the convolutional neural network. In a general singing voice detection algorithm based on a convolutional neural network, a network input layer is a time-frequency graph matrix, and the time-frequency graph is obtained by windowing a music signal with a certain length and performing Fourier transform, namely a time-frequency graph with a scale. Although a time-frequency graph of a certain scale extracts typical features of the original signal, analysis of some problems may be enough, but the time-frequency graph only retains information of only one scale, and some problems need more scales of information because more scales of information are more beneficial to analyzing the problem. The essence of the short-time Fourier transform is cosine-based matching signals intercepted by a window function, and when the window length and the signal matching degree are high, the signals can be more accurately represented, so that when a single-scale time-frequency diagram cannot meet the signal analysis requirement, a multi-scale time-frequency diagram is provided, and the signals can be more favorably analyzed. Fig. 1 is a time-frequency diagram of a song with a song name in two different scales, and it can be seen from the diagram that the time-frequency diagram with the scale of 2048 (lower diagram) is clearer than the time-frequency diagram with the scale of 512 (upper diagram), which shows that the time-frequency diagram of the song expresses information more accurately in the scale of 2048, and at this time, if the information of two scales is integrated, it is obviously more beneficial to signal analysis. According to the principle, the invention firstly generates a plurality of two-dimensional time-frequency diagram matrixes with different scales by adjusting the window length of short-time Fourier transform, and then sends the time-frequency diagrams into the convolutional neural network in a parallel multi-channel mode, so that the neuron receptive field of the convolutional neural network can simultaneously observe information of a plurality of scales of music signals, thereby enhancing the time-frequency diagram feature extraction and resolution capability of neurons and improving the overall performance of singing voice detection.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a singing voice detection method based on a multi-scale time-frequency graph parallel input convolution neural network aiming at the defects in the prior art, and on one hand, the neural network can simultaneously observe information of multiple scales of music signals through the multi-channel multi-scale time-frequency graph parallel input, so that the resolving power of high-frequency and low-frequency parts is adjusted, the characteristics of the singing voice are accurately extracted, and the integral performance of the singing voice detection is improved; on the other hand, the multi-channel data corresponds to a music signal with the same classification, is transverse data amplification and has a promoting effect on improving the detection accuracy.
The technical scheme is as follows: in order to achieve the purpose, the invention provides a singing voice detection method based on a multi-scale time-frequency graph parallel input convolutional neural network, which is characterized by comprising the following steps: the method comprises the following specific steps:
step 1: for single music filePerforming short-time Fourier transform by different window lengths wi,i∈[1..n]To obtain time-frequency graphs F with different scalesi,i∈[1..n]And stored in the form of n data files;
step 2: setting training, verifying and testing data sets, wherein each data set comprises singing voice marking information of corresponding music;
1) carrying out short-time Fourier transform on a single music file in each data set according to the step 1 to obtain time-frequency diagram files with n scales, wherein if the data sets totally contain m music files, the total number of the generated time-frequency diagram files is mxn;
2) performing matrix data slicing operation on time-frequency graph files of a training, verifying and testing data set on a time axis, wherein the line number of a slicing matrix is kept the same as that of the time-frequency graph files, each slicing matrix corresponds to a small image, the length and the width of each small image are set as h and w, in order to keep the continuity of data, the data of the slicing matrix has certain repetition, therefore, the interval time hop of slicing is less than the width of the matrix, zero padding operation is performed on the last matrix of the time-frequency graph files with the width less than w, the sliced small images are sequentially sequenced and numbered according to music files, and all the small images of the training, verifying and testing set are respectively represented as Ti,j,Vi,k,Ui,lWherein i represents scale serial number, j, k, l respectively represents small image serial number in training, verifying and testing data set, parameters h, w and hop are kept the same when time-frequency files of different scales of the same music are subjected to matrix slicing, therefore, time points corresponding to small images of different scales are the same, and small image combinations of all scales of the same time point are marked as
Figure BDA0003204122330000051
Wherein the small images are single channel data;
3) computing all small image data in training, verifying and testing data set
Figure BDA0003204122330000052
And in a matrix Mmax,MminPreservation ofAs a parameter for normalization operation of the small image data;
4) in a matrix Mmax,MminAs parameters, for all small images
Figure BDA0003204122330000053
Carrying out maximum and minimum normalization to obtain small image combination
Figure BDA0003204122330000054
5) Combining small images
Figure BDA0003204122330000055
And (3) performing three-channel gray image conversion, wherein the value of the converted image data is between 0 and 255, although the three-channel data of the gray image are the same, the three-channel gray image is more intuitive data representation simulating macroscopic view, and two channels of data are added, so that the dimensionality of the features is increased, and the data extraction by a neural network is more facilitated to a certain extent. The converted small image combination is recorded as
Figure BDA0003204122330000056
Where each small image is three channel data.
6) Computing
Figure BDA0003204122330000057
The mean and variance of all small image data in (1), where the mean and variance are the summary information of all small image data of each channel, are different from the form of the 3) step matrix, because each channel only summarizes one mean and variance, the mean and variance have only 3 equal values, denoted as u, σ, respectively.
7) To pair
Figure BDA0003204122330000058
Standardized by the parameters u, sigma, converted into small image combinations to be input into the convolutional neural network
Figure BDA0003204122330000059
8) Calculating each multi-scale multi-channel small image combination according to the singing voice marking information of the music
Figure BDA00032041223300000510
Corresponding marking information yj,yk,yl
And step 3: constructing a singing voice detection network based on a convolutional neural network and provided with n scale small image inputs, wherein the number of input channels is 3 multiplied by n;
the structure diagram of the convolutional neural network comprises four components:
the first part is an input layer, where the input layer has 3 × n input channels;
the second part and the third part have the same structure and are channel attention convolutional layers which respectively consist of 2 BN convolutional blocks, 1 maximum value pooling layer and 1 SEBlock channel attention layer;
the structure of the BN convolution block and the SEColck is characterized in that the BN convolution block consists of 1 3 multiplied by 3 convolution, 1 BatchNorm layer and a Relu unit; SEBlock is a squeezing and exciting module, assuming that the convolution output F of the previous layer is a picture with the height and width of h multiplied by w, the number of channels is c, squeezing operation is a global tie pooling layer, and c channels are compressed into c descriptors; the first step of the excitation operation is a door mechanism, and specifically comprises that a first full-connection layer reduces dimensions of c descriptors by r times, then a Relu function is used for carrying out nonlinear transformation, and then a second full-connection layer multiplies the dimensions by r; the second step of excitation operation is that firstly, a Sigmod activation function is used for carrying out weight estimation on channels, then, the channels are adjusted according to the weight estimation through Scale operation, finally, the adjusted channels F' enter a next layer of network, SEBlock enables the action of the channels on the next layer of network to be changed, the weights are not equal any more, but are obtained through learning, and the process is essentially a learning and distributing process of channel attention; the fourth part is a feature vector extraction layer which comprises 3 full-connection layers and 2 Dropout layers, wherein the full-connection layers store high-level information extracted by the previous convolutional layer, the dimension is further reduced in a feature vector mode, finally output one-dimensional data determines whether singing voice segments corresponding to the input n-scale time-frequency graphs contain singing voice or not, the output one-dimensional data is converted into probability values by a Sigmod function, and then the loss of training is calculated by a weighted binary cross entropy loss function;
and 4, step 4: training and testing, and counting the evaluation result;
1) small image combination of training data set obtained from step 2
Figure BDA0003204122330000061
Randomly extracting a batch of b small image combinations
Figure BDA0003204122330000071
And the corresponding label ys,s=[1..b]Inputting the training data into the neural network in the step 3, and randomly extracting b small image combinations from the rest data sets again after one batch of training is finished until the data of all the training data sets are extracted completely, and finishing one round of training; stopping training and entering the test if the number of training rounds reaches the set limit number;
2) small image combinations of the set validation data sets obtained from step 2
Figure BDA0003204122330000072
In the sequence, a batch of b small image combinations are taken out
Figure BDA0003204122330000073
And the corresponding label ys,s=[1..b]Inputting the data into the neural network in the step 3 for verification to obtain a prediction result of the batch; after one batch of verification is finished, sequentially extracting b small image combinations from the rest data sets, and finishing one-time verification until all the verification data sets are completely extracted; after one-time verification is finished, obtaining the accuracy of a prediction result, stopping training if the accuracy is not improved for e times continuously, otherwise, continuing to execute the step 1) for training;
3) combination of small images from the test dataset obtained in step 2
Figure BDA0003204122330000074
In the sequence, a batch of b small image combinations are taken out
Figure BDA0003204122330000075
And the corresponding label ys,s=[1..b]Inputting the data into the neural network in the step 3 for testing to obtain a prediction result of the batch; after one batch of verification is finished, sequentially extracting b small image combinations from the rest data sets until the data of all the test data sets are extracted;
4) after the test is finished, firstly, calculating the evaluation index of the singing voice detection of each song, and then calculating the average value of the indexes of all songs to be the evaluation result of the test;
if the prediction result is singing voice, the result is called positive direction P, and if the prediction result is not singing voice, the result is called negative direction N; by comparing with singing voice labels in the data set, if the prediction result is correct, the result is marked as T, if the prediction result is wrong, the result is marked as F, and therefore the number of samples O of the prediction result is predictedtp,Ofp,Ofn,OfpRespectively recording as:
Otp: predicting the total number of samples of the forward P as the correct T of the prediction result;
Otn: predicting the total number of negative N samples when the prediction result is correct T;
Ofp: predicting the total number of samples of the forward P, namely the total number of false-reported samples, to be wrong F;
Ofn: predicting the total number of negative N samples, namely the total number of missed samples, if the prediction result is wrong F;
for each song, an accuracy A, an accuracy P, a recall R and an F value are calculated respectively, wherein the F value is the integration of the accuracy P and the recall R:
Figure BDA0003204122330000081
Figure BDA0003204122330000082
Figure BDA0003204122330000083
Figure BDA0003204122330000084
as a further improvement of the invention, the time-frequency diagram calculation process in step 1 comprises:
1) setting the Window Length wi,i∈[1..n]Calculating a short-time Fourier transform S for the music file xi=stft(x,wi) When short-time Fourier transform is calculated, half of the window length is respectively filled on the left side and the right side of the data sequence of x, so that the time of the frame numbers corresponding to the time-frequency graphs with different scales is consistent, the singing voice labeling time corresponding to the time-frequency graphs is kept consistent, and therefore only one classification result is obtained after the time-frequency graphs are input to a convolutional neural network in parallel;
2) to SiFrequency of (D) is normalized by Mel scale, Mi=mel(Si);
3) To MiTaking logarithm of the coefficient to obtain a time-frequency diagram Fi=todb(Mi);
4) Time-frequency diagram FiActually, the music file is a two-dimensional matrix, the rows of the matrix represent Mel frequency serial numbers, the columns of the matrix correspond to the music proceeding time, the data of the matrix are stored in a file form for further processing, and for a single music file, n time-frequency diagram files with different scales and corresponding time-frequency diagram files exist.
As a further improvement of the invention, the maximum minimum matrix M in step 2max,MminThe calculation of (c) is expressed by the following formula:
Figure BDA0003204122330000091
Figure BDA0003204122330000092
wherein M ismax,MminThe maximum and minimum values of all small image pixel positions in the data set including training, validation and testing are stored in a matrix form.
As a further improvement of the present invention, the operation of the BN layer in step 2 is expressed by the following formula:
Figure BDA0003204122330000093
Figure BDA0003204122330000094
Figure BDA0003204122330000095
Figure BDA0003204122330000096
wherein xiFor the input of BN layer, formula (1) calculates the mean of the batch samples, formula (2) calculates the variance, formula (3) normalizes the samples, formula (4) adds two trainable parameters γ and β to enhance expression, ziIs the output of the BN layer.
The singing voice detection method based on the multi-scale time-frequency graph parallel input convolutional neural network has the beneficial effects that:
1) the evaluation result of the algorithm of the application is better than the evaluation results of the algorithms of the traditional CNN method, the patent number ZL202110192300.4 and the application number CN202010164594.5 when the experiment is carried out under the open data set Jamendo (JMD for short). In the experiment, keeping the training, verification and test set division of each algorithm execution process JMD involved in comparison the same, each algorithm is executed 3 times respectively, and calculating the average value of the percentage of the two indexes of the accuracy a and F as the evaluation result, wherein the patent algorithm with application number CN202010164594.5 executes 3 times of the average evaluation result of a and F under the condition of the depth of 14, 18, 34, 50, 101, 512, 200 respectively, and the result is shown in table 1 below:
table 1 comparison of results of the algorithm of the present application and other algorithms
Figure BDA0003204122330000101
The algorithm provided by the application is respectively 3.91 and 3.25 percent higher than the indexes A and F of the traditional CNN method, 2.26 and 1.72 percent higher than the indexes A and F of the patent CN202010164594.5, and 2.09 and 1.89 percent higher than the indexes A and F of the patent ZL 202110192300.4.
2) The application provides a singing voice detection method of a convolutional neural network based on multi-scale time-frequency graph parallel input, through multi-channel multi-scale time-frequency graph parallel input, on one hand, the neural network can observe information of multiple scales of music signals at the same time, so that the network can adjust the resolution capability of high-frequency and low-frequency parts, accurately extract the characteristics of the singing voice and improve the overall performance of singing voice detection; on the other hand, the multi-channel data corresponds to a music signal with the same classification, is transverse data amplification and has a promoting effect on improving the detection accuracy. The multi-channel input mode is completely different from the two patents of patent numbers ZL202110192300.4 and CN202010164594.5 applied by the applicant in the prior art, and has the function of improving the overall performance.
3) The time-frequency graph of each scale provided by the application is converted by a three-channel gray-scale graph, the converted small image is changed into standard image data of three channels from one original channel, and the data size is between 0 and 255. Although the three-channel data of the gray level image are the same, the three-channel gray level image is more visual image data representation simulating naked eyes, and the data of two channels are added, so that the dimensionality of the features is increased, and the data extraction by a neural network is more facilitated.
4) In the application, before small image data in training, verifying and testing data sets are converted into three-channel gray-scale images, global normalization is performed, wherein a maximum and minimum normalization method is adopted in the global normalization, and the used maximum and minimum values are the maximum and minimum values of pixel levels instead of summarizing the maximum and minimum values of all pixels. The normalization mode can be used for normalizing data and simultaneously keeping the data property of the pixel level, and is beneficial to improving the integral effect of singing voice detection. The experiment is carried out under a public data set JMD, under the same condition, the method is adopted to carry out global normalization ratio without normalization and to carry out normalization comparison by adopting a summarizing maximum and minimum value method, the indexes A and F are respectively higher by 0.92 and 0.81 percent than those of the summarizing maximum and minimum value method, and the indexes A and F are respectively higher by 2.08 and 2.04 percent than those of the summarizing maximum and minimum value method. The experimental data are tabulated below:
TABLE 2 comparison of evaluation results of normalization methods
Figure BDA0003204122330000111
5) In the neural network design, BN (BatchNorm) layers are designed in the second part, the third part and the fourth part, so that on one hand, the problem of Internal Covariate Shift is solved, and a larger learning rate can be adopted in the training process to accelerate convergence; BN, on the other hand, alleviates the problems of gradient disappearance and gradient explosion; moreover, BN increases the generalization ability to some extent. Due to the addition of the BN layer, the performance of the neural network is improved, and the accuracy of singing voice detection is improved.
6) The BN layer is matched with the SEBlock channel attention reestimation mechanism, so that the overall effect of the neural network is better. Compared with a neural network designed with a patent number ZL202110192300.4, the main different points comprise a multi-scale multi-channel parallel input, a BN layer and an SEBlock; in contrast to the patent with application number CN202010164594.5, except for two main differences: outside the parallel input of multiscale multichannel and BN layer, though also adopt SEBlock, nevertheless do not pile up the SEBlock network, pile up the SEBlock and can reach the effect that promotes the detection accuracy to a certain extent, but this can lead to training time long, inefficiency, the time that this application patent's 200 layer network trained a round under JMD reaches 103 minutes, and this application training round only needs 11 minutes, and the training time is 9.4 times of this application. Under the condition of short training time, the accuracy and the F value of the method are respectively improved by 2.26 percent and 1.72 percent compared with the method.
Drawings
FIG. 1 is a time-frequency plot of two dimensions of a song;
FIG. 2 is a BN convolution block structure and SEBlock structure;
fig. 3 is a schematic diagram of singing voice detection.
Detailed Description
The present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as merely illustrative of the invention and not as limiting the scope of the invention. It should be noted that, as used in the following description, the terms "front", "rear", "left", "right", "upper" and "lower" refer to directions in the drawings, and the terms "inner" and "outer" refer to directions toward and away from, respectively, the geometric center of a particular component.
As a specific embodiment, the invention provides a singing voice detection method based on a multi-scale time-frequency graph parallel input convolutional neural network, which comprises the following specific steps:
step 1: short-time Fourier transform of a single music file through different window lengths wi,i∈[1..n]To obtain time-frequency graphs F with different scalesi,i∈[1..n]And stored in the form of n data files;
the time-frequency diagram calculation process comprises the following steps:
1) setting the Window Length wi,i∈[1..n]Calculating a short-time Fourier transform S for the music file xi=stft(x,wi). When calculating the short-time Fourier transform, half of the window length is filled in the left and right of the data sequence of x, so that a plurality of time intervals with different scales can be obtainedThe time of the frame serial numbers corresponding to the frequency graphs is consistent, so that the singing voice labeling time corresponding to the time frequency graphs is kept consistent, and only one classification result is obtained after the time frequency graphs are input to the convolutional neural network in parallel;
2) to SiFrequency of (D) is normalized by Mel scale, Mi=mel(Si);
3) To MiTaking logarithm of the coefficient to obtain a time-frequency diagram Fi=todb(Mi);
4) Time-frequency diagram FiActually, the music playing device is a two-dimensional matrix, the rows of the matrix represent Mel frequency serial numbers, the columns of the matrix correspond to the music playing time, and the data of the matrix are stored in a file form for further processing. For a single music file, there are n time-frequency diagram files of different scales and their correspondences.
Step 2: setting training, verifying and testing data sets, wherein each data set comprises singing voice marking information of corresponding music;
1) carrying out short-time Fourier transform on a single music file in each data set according to the step 1 to obtain time-frequency diagram files with n scales, wherein if the data sets totally contain m music files, the total number of the generated time-frequency diagram files is mxn;
2) performing matrix data slicing operation on time-frequency graph files of a training, verifying and testing data set on a time axis, wherein the line number of a slicing matrix is kept the same as that of the time-frequency graph files, each slicing matrix corresponds to a small image, the length and the width of each small image are set as h and w, in order to keep the continuity of data, the data of the slicing matrix has certain repetition, therefore, the interval time hop of slicing is less than the width of the matrix, zero padding operation is performed on the last matrix of the time-frequency graph files with the width less than w, the sliced small images are sequentially sequenced and numbered according to music files, and all the small images of the training, verifying and testing set are respectively represented as Ti,j,Vi,k,Ui,lWherein i represents a scale serial number, j, k, l respectively represent a small image serial number in the training, verifying and testing data set, and parameters h, w and l are used for matrix slicing of time-frequency files with different scales of the same musicThe hop remains the same, so the time points corresponding to the small images at different scales are the same, and the combination of the small images at all scales at the same time point is recorded as
Figure BDA0003204122330000141
Wherein the small images are single channel data;
3) computing all small image data in training, verifying and testing data set
Figure BDA0003204122330000142
And in a matrix Mmax,MminStoring the small image data as a parameter for normalization operation;
4) in a matrix Mmax,MminAs parameters, for all small images
Figure BDA0003204122330000143
Carrying out maximum and minimum normalization to obtain small image combination
Figure BDA0003204122330000144
5) Combining small images
Figure BDA0003204122330000145
And (3) performing three-channel gray image conversion, wherein the value of the converted image data is between 0 and 255, although the three-channel data of the gray image are the same, the three-channel gray image is more intuitive data representation simulating macroscopic view, and two channels of data are added, so that the dimensionality of the features is increased, and the data extraction by a neural network is more facilitated to a certain extent. The converted small image combination is recorded as
Figure BDA0003204122330000146
Where each small image is three channel data.
6) Computing
Figure BDA0003204122330000147
All small image dataThe mean and variance of (1), where the mean and variance are the summary information of all small image data for each channel, are different from the 3) step matrix form, because each channel only summarizes one mean and variance, the mean and variance have only 3 equal values, denoted as u, σ, respectively.
7) To pair
Figure BDA0003204122330000151
Standardized by the parameters u, sigma, converted into small image combinations to be input into the convolutional neural network
Figure BDA0003204122330000152
8) Calculating each multi-scale multi-channel small image combination according to the singing voice marking information of the music
Figure BDA0003204122330000153
Corresponding marking information yj,yk,yl
And step 3: constructing a singing voice detection network based on a convolutional neural network and provided with input channels of n scale time-frequency graphs;
the structure diagram of the convolutional neural network constructed by the invention is shown in fig. 2, and comprises four components: the first part is an input layer, where the input layer has 3 × n input channels; the second part and the third part have the same structure and are channel attention convolutional layers which respectively consist of 2 BN convolutional blocks, 1 maximum value pooling layer and 1 SEBlock channel attention layer; the structure of the BN convolution block and the SEColck is characterized in that the BN convolution block consists of 1 3 multiplied by 3 convolution, 1 BatchNorm layer and a Relu unit; the SEBLICK is a squeezing and exciting module, assuming that the convolution output F of the previous layer is a picture with the height and width of h multiplied by w, the number of channels is c, the squeezing operation is a global tie pooling layer, and c channels are compressed into c descriptors; the first step of the excitation operation is a door mechanism, and specifically comprises that a first full-connection layer reduces dimensions of c descriptors by r times, then a Relu function is used for carrying out nonlinear transformation, and then a second full-connection layer multiplies the dimensions by r; the second step of excitation operation is that firstly, a Sigmod activation function is used for carrying out weight estimation on channels, then, the channels are adjusted according to the weight estimation through Scale operation, finally, the adjusted channels F' enter a next layer of network, SEBlock enables the action of the channels on the next layer of network to be changed, the weights are not equal any more, but are obtained through learning, and the process is essentially a learning and distributing process of channel attention; the fourth part is a feature vector extraction layer which comprises 3 full-connection layers and 2 Dropout layers, wherein the full-connection layers store high-level information extracted by the previous convolutional layer, the dimension is further reduced in a feature vector mode, finally output one-dimensional data determines whether singing voice segments corresponding to the input n-scale time-frequency graphs contain singing voice or not, the output one-dimensional data is converted into probability values by a Sigmod function, and then the loss of training is calculated by a weighted binary cross entropy loss function;
in the neural network design, BN (BatchNorm) layers are designed in the second part, the third part and the fourth part, so that on one hand, the problem of Internal Covariate Shift is solved, and a larger learning rate can be adopted in the training process to accelerate convergence; BN, on the other hand, alleviates the problems of gradient disappearance and gradient explosion; moreover, BN increases the generalization ability to some extent. Due to the addition of the BN layer, the performance of the neural network is improved, and the accuracy of singing voice detection is improved. The core idea of the BN layer is that data output by each layer of the neural network is normalized into standard normal distribution to solve the problem of the Internal Covariate Shift, and at the moment, because the normalization can weaken the expression capability of the network to a certain extent, two trainable parameters are added to enhance the expression capability of the network.
The operation of the BN layer can be expressed by the following formula:
Figure BDA0003204122330000161
Figure BDA0003204122330000162
Figure BDA0003204122330000163
Figure BDA0003204122330000164
wherein xiFor the input of the BN layer, formula (1) calculates the mean of the batch of samples, formula (2) calculates the variance, formula (3) normalizes the samples, and formula (4) adds two trainable parameters γ and β to enhance expression. z is a radical ofiIs the output of the BN layer.
And 4, step 4: training and testing, and counting the evaluation result.
1) Randomly extracting a batch (batch) b small image combinations from all multi-channel small image combinations in the training data set in the step 2
Figure BDA0003204122330000171
And the corresponding label ys,s=[1..b]Inputting the training data into the neural network in the step 3, and randomly extracting b small image combinations from the rest data sets again after one batch of training is finished until the data of all the training data sets are extracted completely, and finishing one round of training; stopping training and entering the test if the number of training rounds reaches the set limit number;
2) sequentially taking out a batch of b small image combinations from all multi-channel small image combinations in the verification data set in the step 2
Figure BDA0003204122330000172
And the corresponding label ys,s=[1..b]Inputting the data into the neural network in the step 3 for verification to obtain a prediction result of the batch; after one batch of verification is finished, sequentially extracting b small image combinations from the rest data sets, and finishing one-time verification until all the verification data sets are completely extracted; after the primary verification is finished, the accuracy of the prediction result can be obtained. If the accuracy rate is not improved for e times continuously, stopping training, otherwise, continuing to execute the step 1) for training.
3) Sequentially taking out a batch of b small image combinations from all multi-channel small image combinations in the test data set in the step 2
Figure BDA0003204122330000173
And the corresponding label ys,s=[1..b]Inputting the data into the neural network in the step 3 for testing to obtain a prediction result of the batch; and after one batch of verification is finished, sequentially extracting b small image combinations from the rest data sets until the data extraction of all the test data sets is finished.
4) After the test is finished, firstly, the evaluation index of the singing voice detection of each song is calculated, and then the average value of the indexes of all songs is calculated to be the evaluation result of the test. If the prediction is singing, we call positive P (Positive), and if the prediction is not singing, we call negative N (negative); comparing the predicted result with the singing voice label in the data set, if the predicted result is correct, recording the result as T (true), and if the predicted result is wrong, recording the result as F (false), and predicting the sample number O of the resulttp,Ofp,Ofn,OfpRespectively recording as:
Otp: predicting the total number of samples of the forward P as the correct T of the prediction result;
Otn: predicting the total number of negative N samples when the prediction result is correct T;
Ofp: predicting the total number of samples of the forward P, namely the total number of false-reported samples, to be wrong F;
Ofn: predicting the total number of negative N samples, namely the total number of missed samples, if the prediction result is wrong F;
for each song, the accuracy rate a (accuracy), precision rate p (precision), recall rate r (recall), and F-value (F-measure) are calculated separately, where the F-value is the combination of precision rate p (precision) and recall rate r (recall):
Figure BDA0003204122330000181
Figure BDA0003204122330000182
Figure BDA0003204122330000183
Figure BDA0003204122330000184
the above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (4)

1. The singing voice detection method based on the multi-scale time-frequency graph parallel input convolution neural network is characterized by comprising the following steps: the method comprises the following specific steps:
step 1: short-time Fourier transform of a single music file through different window lengths wi,i∈[1..n]To obtain time-frequency graphs F with different scalesi,i∈[1..n]And stored in the form of n data files;
step 2: setting training, verifying and testing data sets, wherein each data set comprises singing voice marking information of corresponding music;
1) carrying out short-time Fourier transform on a single music file in each data set according to the step 1 to obtain time-frequency diagram files with n scales, wherein if the data sets totally contain m music files, the total number of the generated time-frequency diagram files is mxn;
2) performing matrix data fragmentation operation on time-frequency graph files of training, verifying and testing data sets on a time axis, keeping the number of rows of fragmentation matrixes the same as that of the time-frequency graph files, and enabling each fragmentation matrix to correspond to one timeSetting the length and width of the small images as h and w, in order to keep the continuity of the data, the data of the fragmentation matrix has certain repetition, therefore, the interval time hop of the fragmentation is smaller than the width of the matrix, zero filling operation is carried out on the matrix of which the last fragmentation width of the time-frequency image file is smaller than w, the small images after the fragmentation are sequenced and numbered according to the sequence of the music files, and all the small images of the training, verification and test set are respectively represented as Ti,j,Vi,k,Ui,lWherein i represents scale serial number, j, k, l respectively represents small image serial number in training, verifying and testing data set, parameters h, w and hop are kept the same when time-frequency files of different scales of the same music are subjected to matrix slicing, therefore, time points corresponding to small images of different scales are the same, and small image combinations of all scales of the same time point are marked as
Figure FDA0003204122320000011
Wherein the small images are single channel data;
3) computing all small image data in training, verifying and testing data set
Figure FDA0003204122320000012
And in a matrix Mmax,MminStoring the small image data as a parameter for normalization operation;
4) in a matrix Mmax,MminAs parameters, for all small images
Figure FDA0003204122320000021
Carrying out maximum and minimum normalization to obtain small image combination
Figure FDA0003204122320000022
5) Combining small images
Figure FDA0003204122320000023
Three-channel gray scale image conversion is carried out, and the value of the converted image data is largeThe size of the data is between 0 and 255, and although the three channels of the gray image have the same data, the three-channel gray image is more intuitive data representation simulating macroscopic view, and the data of two channels are added, so that the dimensionality of the features is increased, and the data are more beneficial to feature extraction of the data by a neural network to a certain extent. The converted small image combination is recorded as
Figure FDA0003204122320000024
Where each small image is three channel data.
6) Computing
Figure FDA0003204122320000025
The mean and variance of all small image data in (1), where the mean and variance are the summary information of all small image data of each channel, are different from the form of the 3) step matrix, because each channel only summarizes one mean and variance, the mean and variance have only 3 equal values, denoted as u, σ, respectively.
7) To pair
Figure FDA0003204122320000026
Standardized by the parameters u, sigma, converted into small image combinations to be input into the convolutional neural network
Figure FDA0003204122320000027
8) Calculating each multi-scale multi-channel small image combination according to the singing voice marking information of the music
Figure FDA0003204122320000028
Corresponding marking information yj,yk,yl
And step 3: constructing a singing voice detection network based on a convolutional neural network and provided with n scale small image inputs, wherein the number of input channels is 3 multiplied by n;
the structure diagram of the convolutional neural network comprises four components:
the first part is an input layer, where the input layer has 3 × n input channels;
the second part and the third part have the same structure and are channel attention convolutional layers which respectively consist of 2 BN convolutional blocks, 1 maximum value pooling layer and 1 SEBlock channel attention layer;
the structure of the BN convolution block and the SEColck is characterized in that the BN convolution block consists of 1 3 multiplied by 3 convolution, 1 BatchNorm layer and a Relu unit; SEBlock is a squeezing and exciting module, assuming that the convolution output F of the previous layer is a picture with the height and width of h multiplied by w, the number of channels is c, squeezing operation is a global tie pooling layer, and c channels are compressed into c descriptors; the first step of the excitation operation is a door mechanism, and specifically comprises that a first full-connection layer reduces dimensions of c descriptors by r times, then a Relu function is used for carrying out nonlinear transformation, and then a second full-connection layer multiplies the dimensions by r; the second step of excitation operation is that firstly, a Sigmod activation function is used for carrying out weight estimation on channels, then, the channels are adjusted according to the weight estimation through Scale operation, finally, the adjusted channels F' enter a next layer of network, SEBlock enables the action of the channels on the next layer of network to be changed, the weights are not equal any more, but are obtained through learning, and the process is essentially a learning and distributing process of channel attention; the fourth part is a feature vector extraction layer which comprises 3 full-connection layers and 2 Dropout layers, wherein the full-connection layers store high-level information extracted by the previous convolutional layer, the dimension is further reduced in a feature vector mode, finally output one-dimensional data determines whether singing voice segments corresponding to the input n-scale time-frequency graphs contain singing voice or not, the output one-dimensional data is converted into probability values by a Sigmod function, and then the loss of training is calculated by a weighted binary cross entropy loss function;
and 4, step 4: training and testing, and counting the evaluation result;
1) small image combination of training data set obtained from step 2
Figure FDA0003204122320000031
Randomly extracting a batch of b small image combinations
Figure FDA0003204122320000032
And the corresponding label ys,s=[1..b]Inputting the training data into the neural network in the step 3, and randomly extracting b small image combinations from the rest data sets again after one batch of training is finished until the data of all the training data sets are extracted completely, and finishing one round of training; stopping training and entering the test if the number of training rounds reaches the set limit number;
2) small image combinations of the set validation data sets obtained from step 2
Figure FDA0003204122320000033
In the sequence, a batch of b small image combinations are taken out
Figure FDA0003204122320000041
And the corresponding label ys,s=[1..b]Inputting the data into the neural network in the step 3 for verification to obtain a prediction result of the batch; after one batch of verification is finished, sequentially extracting b small image combinations from the rest data sets, and finishing one-time verification until all the verification data sets are completely extracted; after one-time verification is finished, obtaining the accuracy of a prediction result, stopping training if the accuracy is not improved for e times continuously, otherwise, continuing to execute the step 1) for training;
3) combination of small images from the test dataset obtained in step 2
Figure FDA0003204122320000042
In the sequence, a batch of b small image combinations are taken out
Figure FDA0003204122320000043
And the corresponding label ys,s=[1..b]Inputting the data into the neural network in the step 3 for testing to obtain a prediction result of the batch; after one batch of verification is finished, sequentially extracting b small image combinations from the rest data sets until the data of all the test data sets are extracted;
4) after the test is finished, firstly, calculating the evaluation index of the singing voice detection of each song, and then calculating the average value of the indexes of all songs to be the evaluation result of the test;
if the prediction result is singing voice, the result is called positive direction P, and if the prediction result is not singing voice, the result is called negative direction N; by comparing with singing voice labels in the data set, if the prediction result is correct, the result is marked as T, if the prediction result is wrong, the result is marked as F, and therefore the number Q of samples of the prediction result is predictedtp,Ofp,Ofn,OfpRespectively recording as:
Otp: predicting the total number of samples of the forward P as the correct T of the prediction result;
Otn: predicting the total number of negative N samples when the prediction result is correct T;
Ofp: predicting the total number of samples of the forward P, namely the total number of false-reported samples, to be wrong F;
Ofn: predicting the total number of negative N samples, namely the total number of missed samples, if the prediction result is wrong F;
for each song, an accuracy A, an accuracy P, a recall R and an F value are calculated respectively, wherein the F value is the integration of the accuracy P and the recall R:
Figure FDA0003204122320000051
Figure FDA0003204122320000052
Figure FDA0003204122320000053
Figure FDA0003204122320000054
2. the singing voice detection method based on the multi-scale time-frequency graph parallel input convolutional neural network as claimed in claim 1, characterized in that: the time-frequency diagram calculation process in the step 1 comprises the following steps:
1) setting the Window Length wi,i∈[1..n]Calculating a short-time Fourier transform S for the music file xi=stft(x,wi) When short-time Fourier transform is calculated, half of the window length is respectively filled on the left side and the right side of the data sequence of x, so that the time of the frame numbers corresponding to the time-frequency graphs with different scales is consistent, the singing voice labeling time corresponding to the time-frequency graphs is kept consistent, and therefore only one classification result is obtained after the time-frequency graphs are input to a convolutional neural network in parallel;
2) to SiFrequency of (D) is normalized by Mel scale, Mi=mel(Si);
3) To MiTaking logarithm of the coefficient to obtain a time-frequency diagram Fi=todb(Mi);
4) Time-frequency diagram FiActually, the music file is a two-dimensional matrix, the rows of the matrix represent Mel frequency serial numbers, the columns of the matrix correspond to the music proceeding time, the data of the matrix are stored in a file form for further processing, and for a single music file, n time-frequency diagram files with different scales and corresponding time-frequency diagram files exist.
3. The singing voice detection method based on the multi-scale time-frequency graph parallel input convolutional neural network as claimed in claim 1, characterized in that: maximum and minimum matrix M in step 2max,MminThe calculation of (c) is expressed by the following formula:
Figure FDA0003204122320000061
Figure FDA0003204122320000062
wherein M ismax,MminThe maximum and minimum values of all small image pixel positions in the data set including training, validation and testing are stored in a matrix form.
4. The singing voice detection method based on the multi-scale time-frequency graph parallel input convolutional neural network as claimed in claim 1, characterized in that: the operation of the BN layer in step 2 is expressed by the following formula:
Figure FDA0003204122320000063
Figure FDA0003204122320000064
Figure FDA0003204122320000065
Figure FDA0003204122320000066
wherein xiFor the input of BN layer, formula (1) calculates the mean of the batch samples, formula (2) calculates the variance, formula (3) normalizes the samples, formula (4) adds two trainable parameters γ and β to enhance expression, ziIs the output of the BN layer.
CN202110912362.8A 2021-08-10 2021-08-10 Singing voice detection method based on multi-scale time-frequency graph parallel input convolution neural network Withdrawn CN113627327A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110912362.8A CN113627327A (en) 2021-08-10 2021-08-10 Singing voice detection method based on multi-scale time-frequency graph parallel input convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110912362.8A CN113627327A (en) 2021-08-10 2021-08-10 Singing voice detection method based on multi-scale time-frequency graph parallel input convolution neural network

Publications (1)

Publication Number Publication Date
CN113627327A true CN113627327A (en) 2021-11-09

Family

ID=78383894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110912362.8A Withdrawn CN113627327A (en) 2021-08-10 2021-08-10 Singing voice detection method based on multi-scale time-frequency graph parallel input convolution neural network

Country Status (1)

Country Link
CN (1) CN113627327A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115856768A (en) * 2023-03-01 2023-03-28 深圳泽惠通通讯技术有限公司 DOA (direction of arrival) estimation method and system based on convolutional neural network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115856768A (en) * 2023-03-01 2023-03-28 深圳泽惠通通讯技术有限公司 DOA (direction of arrival) estimation method and system based on convolutional neural network

Similar Documents

Publication Publication Date Title
CN108198574B (en) Sound change detection method and device
CN112562741B (en) Singing voice detection method based on dot product self-attention convolution neural network
CN111400540B (en) Singing voice detection method based on extrusion and excitation residual error network
CN113380255B (en) Voiceprint recognition poisoning sample generation method based on transfer training
CN113284513B (en) Method and device for detecting false voice based on phoneme duration characteristics
CN113571067A (en) Voiceprint recognition countermeasure sample generation method based on boundary attack
Wei et al. A method of underwater acoustic signal classification based on deep neural network
Khdier et al. Deep learning algorithms based voiceprint recognition system in noisy environment
CN114579743A (en) Attention-based text classification method and device and computer readable medium
Wang et al. Densely connected convolutional network for audio spoofing detection
CN113436646B (en) Camouflage voice detection method adopting combined features and random forest
CN113627327A (en) Singing voice detection method based on multi-scale time-frequency graph parallel input convolution neural network
CN113450806B (en) Training method of voice detection model, and related method, device and equipment
CN114398611A (en) Bimodal identity authentication method, device and storage medium
CN110246509A (en) A kind of stack denoising self-encoding encoder and deep neural network structure for voice lie detection
CN113239809A (en) Underwater sound target identification method based on multi-scale sparse SRU classification model
CN114022706A (en) Method, device and equipment for optimizing image classification model and storage medium
CN113362814A (en) Voice identification model compression method fusing combined model information
CN117649621A (en) Fake video detection method, device and equipment
CN108847251A (en) A kind of voice De-weight method, device, server and storage medium
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
CN115101077A (en) Voiceprint detection model training method and voiceprint recognition method
Qais et al. Deepfake audio detection with neural networks using audio features
Sailor et al. Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection.
CN115064175A (en) Speaker recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20211109

WW01 Invention patent application withdrawn after publication