CN110599987A

CN110599987A - Piano note recognition algorithm based on convolutional neural network

Info

Publication number: CN110599987A
Application number: CN201910787062.4A
Authority: CN
Inventors: 董瓒; 马学健; 郭玲
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2019-08-25
Filing date: 2019-08-25
Publication date: 2019-12-20

Abstract

The invention discloses a piano note recognition algorithm based on a convolutional neural network, which mainly comprises the following steps: searching a starting point and an end point of each note from a continuous piano audio through an end point detection algorithm; dividing the complete piano audio into a set of single note audio files; drawing a spectrogram of each note; and inputting the spectrogram into a trained neural network to finish the identification. The invention provides an algorithm for searching a short-time energy difference peak point and combining double thresholds, which overcomes the defect that the traditional double-threshold algorithm excessively depends on the setting of the threshold; the method has the advantages that the audio signal is processed and converted into the digital image for identification through drawing of the spectrogram, frequency doubling errors generated when the fundamental frequency is extracted by a traditional time domain method are overcome, and the calculation speed and accuracy are greatly improved compared with the traditional frequency domain method.

Description

Piano note recognition algorithm based on convolutional neural network

Technical Field

The invention belongs to an audio signal processing technology, and particularly relates to a piano note identification algorithm based on a convolutional neural network.

Background

With the development of economy and the improvement of culture, the number of music fans is increasing, but limited to factors such as energy and time, a considerable part of music fans choose to self-learn and practice during the off-hours. Because of lacking professional's guidance, the condition such as the wrong note of play and oneself can't judge often can appear, and the software that this moment a section can automatic identification piano play sound can help them to a great extent, and piano play note discernment can also alleviate music worker's working strength simultaneously, is favorable to the intellectuality of music processing and creation.

The piano note identification algorithm mainly comprises an end point detection part, a note segmentation part and a pitch identification part.

The end point detection and the note segmentation are key steps before note identification, and the accurate end point detection is a precondition for ensuring the accuracy of note identification. The double-threshold algorithm is the most classical endpoint algorithm, and the method respectively sets high and low threshold values (marked as delta) of short-time energy and short-time zero crossing rate₁、δ₂And Z₁、Z₂) A complete audio file is divided into four stages. 1 and a silent section: short time energy below delta₂(ii) a 2. Transition section: short time energy greater than delta₂Below delta₁And the short-time zero crossing rate is more than Z₂(ii) a 3. Music segment: short time energy greater than delta₁And the short-time zero crossing rate is more than Z₁(ii) a 4. Short time energy below delta₂Or short time zero crossing rate lower than Z₂. In practice, the noise condition is also taken into account, so that in addition to the above four thresholds, the shortest tone segment length and the longest transition segment length are additionally set for distinguishing noise and preventing tone truncation in advance. Therefore, the accuracy of the algorithm mainly depends on the setting of the threshold, and the setting of the threshold usually takes the background sound of a plurality of frames before the recording, which also has requirements on the recording file, and if a small popping sound occurs at the beginning of the recording, the accuracy rate is greatly reduced, and the practicability is lacked.

Conventional pitch identification has focused on research in both the time and frequency domains. The short-time autocorrelation function is used for judging the similarity degree of two signals in a time domain and is commonly used for detecting the synchronism and the periodicity of the signals. The property that the autocorrelation necessarily has a maximum value at the position of integral multiple of the period provides an important basis for extracting piano pitch, namely fundamental frequency, by using the short-time autocorrelation function. The fundamental frequency is extracted by a traditional autocorrelation function method by drawing a short-time autocorrelation function curve, the autocorrelation function is represented as a peak at a pitch period, and then the interval between two adjacent peaks is a gene period. However, in general, the fundamental component is not the strongest component, and the rich harmonic component makes the waveform of the audio signal very complex, and often a frequency doubling error occurs, that is, the result of the fundamental frequency estimation is the second frequency doubling or second frequency division of the actual fundamental frequency. The wavelet analysis method is used as a method in the field of applied mathematics, and local conversion is carried out on the time and the frequency of a signal, so that the fundamental frequency information in a music signal can be effectively extracted. The specific steps are that a wavelet component curve under the same grade number is drawn, the number n of sampling points between two maximum values in the curve reflects the pitch period, then the number of sampling points between adjacent maximum values under different grade numbers is calculated by continuously changing the grade number, and if the number of sampling points is not changed, the fundamental frequency is determined. Therefore, although the wavelet analysis method can effectively extract the fundamental frequency, the calculation amount is huge because wavelet components under different levels are calculated.

In summary, in the aspect of endpoint detection, the traditional double-threshold algorithm has the disadvantage of being too dependent on the setting of the double-threshold, and in the aspect of pitch identification fundamental frequency extraction, the traditional time domain method is prone to frequency multiplication errors and low in accuracy, while the traditional frequency domain method is high in algorithm complexity, large in calculation amount and low in operation efficiency, and both the frequency domain method and the time domain method have high requirements on signal-to-noise ratio, and cannot accurately extract audio signals with low signal-to-noise ratio.

Disclosure of Invention

The invention aims to provide a piano note identification algorithm based on a convolutional neural network.

The technical solution for realizing the purpose of the invention is as follows: a piano note identification algorithm based on a convolutional neural network comprises the following steps:

step 1, finding out a starting point and an end point of each note from a continuous piano audio through an end point detection algorithm;

step 2, dividing the complete piano audio into a set of single note audio files according to the starting point and the ending point of each note;

step 3, drawing a spectrogram of each note;

and 4, inputting the spectrogram into the trained neural network to finish recognition.

Compared with the prior art, the invention has the following remarkable advantages: (1) compared with the traditional double-threshold algorithm, the short-time energy difference and double-threshold-based endpoint detection algorithm provided by the invention does not excessively depend on the setting of the threshold value, and has high accuracy; (2) compared with the traditional time-frequency domain method, the algorithm for identifying the piano pitch by using the convolutional neural network provided by the invention has the advantages of no frequency doubling error, strong noise resistance, simple algorithm, high operation speed and high accuracy.

Drawings

FIG. 1 is a flow chart of the piano note identification algorithm based on the convolutional neural network of the present invention.

FIG. 2 is a diagram of a neural network used in the present invention.

Fig. 3 is a short time energy plot.

Fig. 4 is a graph illustrating a short-time energy difference curve.

Fig. 5 is a diagram illustrating a short-time energy difference peak point.

FIG. 6 is a schematic diagram of short-term energy difference peak screening.

Detailed Description

As shown in FIG. 1, the piano note identification algorithm based on the convolutional neural network of the present invention comprises the following steps:

step 1, reading a section of audio signal, performing framing and windowing on the audio signal, and performing normalization pretreatment.

The framing windowing represents the music signal from an unstable process as a combination of several frame sequences that are stable and time-invariant, and is the basis of a series of steps followed by calculating the relevant characteristics of the music signal.

Step 2, calculating and drawing a short-time energy difference curve of two adjacent frames, wherein the short-time energy and short-time energy difference formula is as follows:

ΔE_i＝E_i-E_i-1

since short-time energy difference information between frames is calculated, Δ E_iFiltering micro energy fluctuation in a part of original signals, smoothing energy change of the whole audio information, and adopting difference operation to calculate difference value delta E of two adjacent frames_iThe note onset is easier to determine than the energy of the short duration of each frame.

And step 3, searching and marking all maximum value points (peak value points) in the curve as candidate note starting points.

All peak points at this time include a large amount of background noise in the audio signal and the extreme points of the note signal, and need to be filtered.

And 4, setting the minimum peak height according to the background environment sound, and setting the shortest distance between adjacent peak points according to the playing speed.

The minimum peak height is mainly used for filtering background noise, and the shortest distance between adjacent peak points is mainly used for filtering pseudo end points in notes, so that one note is prevented from being cut off for multiple times and needs to be adjusted according to the beat speed when the piano is played.

And 5, screening the peak value points in the B according to the minimum peak value height and the minimum peak value distance set in the step 4, and reserving frames corresponding to the points, namely starting points of all notes.

Step 6, calculating the short-time zero crossing rate of each frame, wherein the formula is as follows:

where w (n) is a window function, sgn represents a sign function, which is defined as follows:

the short-time zero-crossing rate measurement has the significance that the periodic change of the signal can be reflected to a certain extent. For sampled sinusoidal periodic signals, the average zero crossing rate must be twice the signal frequency multiplied by the sampling period, and when the sampling period is fixed, the zero crossing rate reflects the signal frequency information. Especially for regular musical tone signals, the zero-crossing rate is distributed in a certain range, and the rule can be used for distinguishing musical tones from noise because the zero-crossing rate of the noise is larger.

And 7, setting two thresholds of short-time energy and short-time zero-crossing rate, and respectively calculating corresponding end points of each starting point obtained in the step 5.

And 8, judging the position of the end point corresponding to each starting point, and taking the first 10 frames of the starting point after the starting point as the corresponding end point if the end point is behind the next starting point.

And 9, calculating the difference value of each pair of start and stop points, judging the difference value as noise if the difference value is smaller than the set shortest note length, deleting the pair of start and stop points from the set, and finally obtaining the start and stop points of all notes.

Since the steps 8 and 9 carry out re-judgment and re-screening on each start point and each stop point, the dependence of the algorithm on threshold setting is greatly reduced, and the accuracy is improved.

And step 10, dividing the continuous notes in the audio into single notes according to the start and stop point information obtained in the step 9.

And step 11, drawing a spectrogram of each note.

And step 12, inputting the spectrogram into a trained neural network to obtain the pitch. The neural network structure is shown in fig. 2. All convolution kernels in the network are 3 x 3 in size, the pooling layers are in maximum pooling, the number of neurons in the fully-connected layer 1 is 1024, the number of neurons in the fully-connected layer 2 is 88, and the size corresponds to 88 pitches of the piano.

The present invention will be described in detail below with reference to the accompanying drawings and examples.

Examples

The audio file used in the embodiment is an artificially recorded piano playing, and comprises 8 notes in total.

Step 1, after the sound recording file is obtained, performing framing and windowing operation on the sound recording file, wherein the sampling rate is 44100Hz, and a window function is selected as a commonly used Hanning window and is defined as follows:

step 2, after the framing operation is finished, according to a formula:

ΔE_i＝E_i-E_i-1

respectively calculating the short-time energy and the short-time zero crossing rate of each frame and the short-time energy difference of two adjacent frames, storing the results in an array and drawing a curve, wherein the short-time energy is shown in figure 3, and the short-time energy difference is shown in figure 4.

And 3, after obtaining the short-time energy difference curve, searching all wave crests in the curve, namely maximum value points, marking the wave crests in the curve by red asterisks, and storing the peak value points and the peak values into an array for later use. As shown in FIG. 5, the peak of the background noise is generally small and is significantly different from the peak at a position near the note start. A plurality of peak values are detected in a note duration, wherein the highest peak value is a true note starting point, a plurality of adjacent peak values are note pseudo end points, the peak value corresponding to the pseudo end point is slightly lower than the peak value corresponding to the actual starting point, and the distance between the pseudo end point and the actual starting point is small.

And 4, setting the minimum peak height and the shortest peak distance, wherein the minimum peak height is an empirical value, only the piano tones and the environmental background tones are needed to be distinguished, and the shortest peak distance is related to the beat speed adopted when the piano is played. And screening all the peak points in the graph 3 according to the two thresholds, deleting the peak points smaller than the minimum peak height and the peaks thereof, namely noise from the array, wherein the peak values are higher than the minimum peak height but have the distance with the adjacent peak points lower than the shortest peak distance point, and reserving the points with larger peak values, namely pseudo end points. The final screening results are shown in fig. 6.

And 5, obtaining alternative starting points of all notes, starting from each starting point, judging whether the short-time energy and the short-time zero-crossing rate of each frame meet threshold conditions at the same time frame by frame, if one frame meets the end point conditions, continuously judging whether the next 9 frames meet the conditions, if so, judging that the frame is an alternative end point, and if not, continuously judging until the end point is found. And judging whether the position of the end point is before the next starting point after the alternative end point is obtained, if so, setting the position as the end point corresponding to the current starting point, otherwise, indicating that the end point is searched wrongly, and setting the end point corresponding to the current starting point as the first 5 frames of the next starting point.

And repeating the steps until all the end points corresponding to the starting points are searched, and storing the starting points and the end points in an array in a one-to-one correspondence mode and recording the starting points and the end points as the alternative note end points.

And 6, calculating the difference value of each pair of start and stop points, judging that the difference value is greater than the shortest note length, if so, keeping the pair of start and stop points, otherwise, judging that the pair of start and stop points is noise, and deleting the noise from the candidate note end points. This completes the endpoint detection.

And 7, after the start point and the stop point of each note are obtained in the step, reading the part from each pair of start points in the initial recording audio file with the time period from the start point to the end point corresponding to the start point in the step 6, extracting the part from the original audio to obtain an independent audio file, repeating the step, and finally completing audio segmentation to obtain the audio file corresponding to each note, wherein the total number is 8, and the audio file is named as 1 to 8 according to the sequence of the notes.

And 8, drawing a spectrogram of all the note audio files obtained in the step 7, wherein the abscissa of the spectrogram represents time, the ordinate represents frequency, the color represents energy, and the picture name is consistent with the audio name.

And 9, inputting the spectrogram obtained in the step 8 into a neural network, automatically zooming the image to a required input size by the neural network, and finally obtaining a pitch corresponding to each note through calculation of the neural network, wherein an output result is a pitch name. The selected audio frequency of the example comprises 8 piano notes, and the final recognition results are A5, G5, E5, G5, C6, A5, G5 and A5, which are consistent with the actual pitch played and are all correct.

Claims

1. A piano note identification algorithm based on a convolutional neural network is characterized by comprising the following steps:

step 3, drawing a spectrogram of each note;

2. The convolutional neural network-based piano note identification algorithm as claimed in claim 1, wherein step 1 finds the start point of each note using the energy mutation information on the time domain, and calculates the end point of each note using the double threshold algorithm in combination with the start point information;

the short-time energy formula is:

in the formula S_i(m) is an amplitude of an m-th point of the i-th frame audio signal; l is the frame length;

the short-time energy difference is the energy difference delta E between two adjacent frames_iNamely:

ΔE_i＝E_i-E_i-1

3. the convolutional neural network-based piano note identification algorithm as claimed in claim 2, wherein the end point detection algorithm based on short-time energy difference comprises the following steps:

A) calculating and drawing a short-time energy difference curve of two adjacent frames;

B) searching and marking all maximum value points in the curve as candidate note starting points;

C) setting a minimum peak height according to background environment sounds, and setting a shortest distance between adjacent peak points according to playing speed;

D) screening the peak points in the step B according to the minimum peak height and the minimum peak distance set in the step C, wherein the frame corresponding to the reserved points is the starting point of each note;

E) calculating the short-time zero-crossing rate of each frame, wherein the formula is as follows:

F) and D, setting two thresholds of short-time energy and short-time zero-crossing rate, and respectively calculating the corresponding end point of each starting point obtained in the step D.

G) And judging the position of the end point corresponding to each starting point, and if the end point is behind the next starting point, taking the first 10 frames of the starting point behind the starting point as the corresponding end point.

4. The convolutional neural network-based piano note identification algorithm as claimed in claim 3, wherein the read-in audio signal is subjected to frame windowing and normalization before endpoint detection.

5. The convolutional neural network-based piano note identification algorithm as claimed in claim 3, wherein the difference between each pair of start and stop points is calculated, if the difference is smaller than the set shortest note length, it is determined as noise, the pair of start and stop points is deleted from the set, and finally the start and stop points of all notes are obtained.

6. The convolutional neural network-based piano note identification algorithm as claimed in claim 1, wherein step 4 inputs the spectrogram into the trained neural network to obtain the pitch; all convolution kernel sizes in the neural network are 3 x 3, the pooling layer is maximum pooling, the number of neurons of the full connection layer 1 is 1024, the number of neurons of the full connection layer 2 is 88, and the size corresponds to 88 pitches of the piano.