CN116682456A

CN116682456A - Musical instrument classification method based on time-frequency fine analysis

Info

Publication number: CN116682456A
Application number: CN202310583602.3A
Authority: CN
Inventors: 王一歌; 陈浩文; 韦岗; 曹燕
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-05-23
Filing date: 2023-05-23
Publication date: 2023-09-01

Abstract

The invention discloses a musical instrument classification method based on time-frequency fine analysis, which comprises the following steps: fundamental frequency estimation, frequency domain harmonic marking, time domain single-period fundamental frequency sequence estimation, time-frequency fine feature extraction and random forest model. The invention designs a cyclic iteration fundamental frequency estimation algorithm based on autocorrelation and narrowband spectrum energy frequency estimation to carry out fundamental frequency estimation, and then accurately marks harmonic positions and acquires a single-period fundamental frequency sequence based on a harmonic marking algorithm and a time domain single-period fundamental frequency sequence estimation algorithm which are searched by an adaptive window; characterizing the spectrum fine structure and single-period fundamental frequency change through the time-frequency fine characteristics, and extracting; and finally, fusing the time-frequency fine features with 13-dimensional mel frequency cepstrum coefficient features, inputting the fused time-frequency fine features and the 13-dimensional mel frequency cepstrum coefficient features into a random forest model for training, wherein the instrument classification error rate based on the random forest model is reduced by 6% compared with that of the instrument classification error rate based on the mel frequency cepstrum coefficient features.

Description

Musical instrument classification method based on time-frequency fine analysis

Technical Field

The invention relates to the technical field of signal processing, in particular to a musical instrument classification method based on time-frequency fine analysis.

Background

By analyzing the audio file, the musical instrument characteristics of the audio are obtained, and the musical instrument types contained in the audio file can be quickly and accurately identified by musical instrument identification, so that the audio file is beneficial to being applied to the fields of intelligent music classification, musical instrument teaching, music creation and the like.

In the past researches, students obtain a certain classification effect on time-frequency characteristics proposed by musical instrument identification, but the scales of the characteristics are often large, and the described characteristics are often the characteristics of the whole time domain or the whole frequency domain. For example, the amplitude envelope describes the overall variation of the audio amplitude over a period of time, but it does not exactly describe the fine variation between individual time-domain fundamental frequency periods. And the characteristic of the spectrum centroid is just to describe the center of gravity of the frequency domain energy from the macroscopic dimension of the whole spectrogram, but the magnitude relation between different times of harmonic waves and between a certain time of harmonic waves and nearby non-harmonic waves in the frequency domain cannot be finely described, in other words, the spectrum centroid does not have a method for accurately transmitting the information of the spectrum structure. However, the spectral structure often plays a decisive role in the timbre of the instrument. Therefore, the present invention considers that many characteristic researches in the past neglect the fine characterization of the time-frequency characteristics of music, and needs to be solved.

The basis of the time-frequency fine analysis is the accurate estimation of the fundamental frequency, however, the accurate estimation of the fundamental frequency and the accurate marking of the harmonic wave are often complex processes, and complex theory is combined, and abnormal situations of real audio data are considered, such as: spectrum aliasing, noise superposition, spectrum offset, etc., have extremely high requirements on the adaptability and robustness of the fundamental frequency estimation algorithm and the harmonic labeling algorithm. Therefore, it is challenging to implement accurate fundamental frequency estimation and accurate harmonic labeling with codes to accommodate various realistic anomalies, thereby laying a platform foundation for later research.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a musical instrument classification method based on time-frequency fine analysis.

The aim of the invention can be achieved by adopting the following technical scheme:

a musical instrument classification method based on time-frequency fine analysis, the musical instrument classification method comprising the steps of:

s1, inputting single-tone audio of a musical instrument, and accurately estimating fundamental frequency of the audio;

s2, carrying out FFT frequency domain harmonic marking according to the size of the fundamental frequency of the audio;

s3, estimating a time domain single-period fundamental frequency sequence according to the fundamental frequency of the audio;

s4, extracting time-frequency fine features and Mel frequency cepstrum coefficient features;

and S5, based on the extracted video fine features and the Mel frequency cepstrum coefficient features, evaluating feature correlation and variation coefficients, and establishing a random forest model to classify musical instruments.

Further, the step S1 is as follows:

s101, detecting a single-tone audio end point of a musical instrument by adopting a short-time energy method, intercepting effective audio, and calculating the short-time energy by the following formula:

In s [ n ]]Is the input musical instrument single tone audio signal, w [ n ]]Is a window function, n is s [ n ]]And w [ n ]]Time domain index, N _f Is the frame length of the signal frame, z is the amount of shift of the window function, E _z Is short-time energy corresponding to the translation amount z; e corresponding to different translation amounts _z Forming a short-time energy sequence; setting a threshold value for short-time energy of a starting and stopping endpoint of the effective audio of the single-tone audio of the musical instrument, recording the midpoint of the current frame as the starting endpoint of the effective audio when the short-time energy of the current frame exceeds the threshold value of the starting endpoint, and recording the midpoint of the current frame as the ending endpoint of the effective audio when the short-time energy of the current frame is lower than the threshold value of the ending endpoint;

s102, roughly estimating effective audio x [ n ] through time domain autocorrelation]Fundamental frequency f of (f) ₀ The discrete signal autocorrelation formula is as follows:

where τ represents the hysteresis, i.e., the amount of translation, x [ n ]]Shifting tau units to the left, multiplying the shifted tau units by the original signal, and finally accumulating and summing the multiplied results to obtain an autocorrelation value under the current hysteresis value; the autocorrelation value is maximum when the hysteresis value is zero; finding out maximum value point except zero according to the relation between the autocorrelation value and different hysteresis values, wherein the difference between the maximum value point and the hysteresis value of zero is the approximate estimation of fundamental wave period, thereby estimating fundamental frequency f ₀ ；

S103, performing Fast Fourier Transform (FFT) on the audio signal, and recording the frequency f corresponding to the maximum value of the current FFT _max The method comprises the steps of carrying out a first treatment on the surface of the Judging f according to the time domain autocorrelation result _max Whether it is an interfering component; if so, removing the component from the FFT spectrum and re-acquiring the frequency f corresponding to the maximum FFT value _max Repeating the above process until f is considered according to the time domain autocorrelation result _max Not interfering components;

s104, adopting narrow-band spectrum energy frequencyCalculating the denoising frequency value f of the maximum peak corresponding frequency value of the fast Fourier transform by the rate estimation _c Eliminating noise interference; calculating a denoised frequency value f of the FFT spectrum using the narrowband spectral energy frequency estimate with the index of the FFT spectrum array starting at 1 _c The formula of (2) is as follows:

wherein k is ₀ Is the FFT index corresponding to the maximum peak of the FFT, k is the index of the FFT spectrum, |X _k I is the FFT module value corresponding to the kth point in the FFT spectrum, F _s Is the sampling rate of the audio signal, N is the FFT point number;

s105, according to the denoising frequency value f _c And the fundamental frequency f of the autocorrelation estimation ₀ Calculate that the maximum peak is located at about n _m Subharmonic; wherein:

obtaining the fundamental frequency estimated value as f ₁ :

S106, according to the fundamental frequency estimation value f ₁ Resampling the audio time domain signal, new sampling rate f _s Is calculated according to the formula:

f _s ＝f ₁ ×2 ^m formula (6)

Wherein the value of m is as follows:

the new sampling rate is not too large, so that the number N of sampling points cannot cover enough fundamental wave period; re-stepping the re-sampled time domain signalStep S102 starts to be executed until the fundamental frequency estimation value f ₁ Compared with the estimated value of the fundamental frequency in the last cycle, the variation amplitude is smaller than 0.001 or the cycle number reaches 50 times of limit, and the estimated value f of the fundamental frequency in the last time is output ₁ 。

Further, in the step S2, the FFT frequency domain harmonic labeling process according to the fundamental frequency size of the audio frequency is as follows:

s201, detecting an audio endpoint of the musical instrument by adopting a short-time energy method, and intercepting an effective audio part;

s202, estimating the value f of the fundamental frequency according to the step S1 ₁ Resampling the audio time domain signal to make the sampling rate of the audio be f ₁ To the power of (a), reduces spectrum leakage and converts the FFT spectrum to a logarithmic spectrum;

s203, determining a left endpoint l and a right endpoint r of the next harmonic peak search range on the FFT spectrum, and if the next harmonic peak search range is the first 3 times harmonic, the calculation formula is as follows:

where P is the index of the position of the last harmonic peak in the FFT spectrum, if the position of the first harmonic peak is marked, p=1;

if the harmonic is marked 4 th harmonic and later, the calculation formulas of the left endpoint l and the right endpoint r are as follows:

l=p+bt-2 formula (10)

r＝P+1.09T ₁ Formula (11)

Wherein bT is half the distance between the third harmonic peak and the first harmonic peak on the FFT spectrum; t (T) ₁ The two-harmonic FFT point number interval is the closest to the current marked harmonic of the FFT spectrum;

s204, all maximum points are obtained in the search range, and the maximum point of all the maximum points is obtained as the obtained peak point of the current harmonic; if no maximum point exists in the search range, the left and right boundaries of the search range are expanded when the first fourth harmonic is marked, and only the right boundary is expanded when other harmonics are marked; when expanding the search boundary of the first two harmonics, the left and right endpoints l and r satisfy the following conditions:

when expanding the search boundaries of the 3 rd and 4 th harmonics, the left and right endpoints l and r satisfy the following conditions:

l-P≥0.7T ₁ formula (14)

r-P≤1.3T ₁ Formula (15)

When the search boundary of the 5 th order and above harmonics is extended, the right end point r satisfies the following condition:

r-P≤1.3T ₁ formula (16)

Then, the maximum value point of all maximum value points is calculated in the new searching range and is used as the calculated peak value point of the current harmonic, all FFT logarithmic magnitude values and index values of the maximum value points in the harmonic interval of 0.1 times with the maximum value point as the center are recorded and respectively stored in a matrix HWB and a series HWI, and each row of HWB stores the logarithmic value of each harmonic peak and the frequency spectrum logarithmic value nearby the logarithmic value of each harmonic peak; if the maximum point is still not found, the searching range is continuously expanded, and the method is repeated.

Further, in the step S3, the process of estimating the time domain single period baseband frequency sequence according to the baseband frequency size of the audio frequency is as follows:

s301, detecting an audio endpoint of a musical instrument by adopting a short-time energy method, and intercepting an effective audio part;

s302, filtering low-frequency interference by using a high-pass filter;

s303, marking a demarcation point of a single-base wave period by using a cross-correlation operation, and taking a certain point o in the middle of an audio signal as an origin, extending towards two sides at equal intervals, wherein the extending distance is equal to the length of the single-base wave period obtained by the last operation, the signal in the extending range is a shift vector, and the length of the shift vector is twice the length of the single-base wave period obtained by the last operation, so that the self-adaptive adjustment is characterized; if the first cross-correlation operation is performed, the single period length refers to the fundamental frequency estimation result in the step S2; taking the o point as an origin, and taking signals with four single-base wave period lengths to the left or right as fixed vectors, wherein the single-base wave period length is also the single-base wave period length obtained by the last operation; performing cross-correlation operation by taking the shift vector as a sliding window and a fixed vector to obtain a corresponding relation between a cross-correlation value and relative displacement; recording extreme point sequences and corresponding relative displacement sequences of the cross-correlation values from the corresponding relations; selecting a reasonable value from the relative displacement sequence, and calculating a next single-base wave period separation point by taking the o point as a reference, wherein the distance between the separation point and the o is the single-base wave period length; and circularly reciprocating to obtain a time domain single period fundamental frequency sequence, wherein the calculation formula of the cross-correlation r [ tau ] is as follows:

In the formula, τ represents hysteresis value, i.e. translation quantity, g [ n ] and v [ n ] are two different signals, v [ n ] is translated leftwards by a distance of τ units and then multiplied by g [ n ], and the multiplied results are accumulated and summed, i.e. similarity under the current hysteresis value.

Further, in the step S4, the process of extracting the time-frequency fine feature and mel-frequency cepstrum coefficient feature is as follows:

s401, extracting frequency domain fine features, including: the logarithmic value is larger than zero harmonic energy centroid, harmonic peak burr average attenuation ratio, logarithmic value is larger than zero maximum harmonic frequency, slope bandwidth ratio between two harmonics, harmonic peak-valley ratio, harmonic energy drop rate, odd harmonic to even harmonic energy ratio of the first 6 harmonics, first harmonic peak-valley ratio, slope bandwidth ratio drop rate between harmonics, frequency domain logarithmic value is larger than ratio of zero harmonic energy to total energy, harmonic peak-valley difference ratio half harmonic interval, first harmonic peak-valley difference ratio half harmonic interval;

from the HWB obtained in step S2, a log-amplitude series P [ n ] of the maximum value point of each subharmonic peak can be obtained _h ]And setting the number less than or equal to zero in the logarithmic magnitude array to zero, and calculating the index from 1. The calculation formula for the logarithmic value larger than zero harmonic energy centroid c is as follows:

Wherein L is _P Is P [ n ] _h ]Is also the maximum harmonic order, n _h Is the current harmonic frequency, also P [ n ] _h ]Index of (2);

the maximum harmonic number with the logarithmic value greater than zero is the logarithmic magnitude array Pn _h ]The number less than or equal to zero is set to be zero, and the index is started from 1;

calculation of the average reduction ratio of harmonic peaks and burrs requires that, in the case where the logarithmic spectrum of the FFT spectrum in step S2 is greater than zero, twenty non-harmonic peak maxima A [ n ] closest to the harmonic peak maximum in a harmonic spacing range centered on the harmonic peak maximum are obtained _p ]，n _p Is A [ n ] _p ]If the number of the maximum values in the range is less than twenty, taking all the indexes; if the number of maximum values is less than or equal to 10, directly taking an _p ]The mean value of (2) is taken as the average amplitude of burrs around the harmonic wave crest; if the maximum number is greater than 10, then at an _p ]Searching a maximum value group with the length of 10 continuous, wherein the FFT index span of the maximum value group comprises an index where the maximum value of the harmonic peak is located or no other maximum value exists between the index and the index where the maximum value of the harmonic peak is located and the sum is maximum; obtaining the average value of the maximum value group as the average amplitude of burrs around the harmonic peaks; let the average amplitude of the burrs around the harmonic peaks be a, the harmonic peak burr attenuation ratio PDR of the single harmonic is as follows:

Wherein p is the maximum amplitude of the current harmonic peak, and the average of the harmonic peak burr weakening ratios of a plurality of harmonics is obtained to obtain the average weakening ratio of the harmonic peak burrs;

let the logarithmic spectrum of the FFT spectrum of step S2 be mag [ k ]]HWI as determined in step S2 is mag [ k ]]The index sequence where the maximum value of the medium harmonic peak is located is called hi n for short _h ]All indexes of the series start from 1, and the calculation formula of the slope bandwidth ratio SFWR between the 3 rd harmonic and the 4 th harmonic is as follows:

when calculating the bandwidth ratio decline rate of the inter-harmonic slope, firstly calculating the bandwidth ratio of the direct current component and the first-harmonic slope in the logarithmic spectrum as r ₁ Calculating the inter-harmonic slope bandwidth ratio of each harmonic spacing between the 5 th harmonic and the 8 th harmonic and taking the average value as r ₂ If the 8 th harmonic wave is less than, the highest harmonic wave is obtained; the calculation formula of the inter-harmonic slope bandwidth ratio degradation rate SFWRD is as follows:

calculating the harmonic peak-to-valley ratio based on the part with the logarithmic value larger than zero in the FFT logarithmic spectrum to make mag [ k ]]The elements less than or equal to zero are zero, and a new number array MAG [ k ] is obtained]The method comprises the steps of carrying out a first treatment on the surface of the For a harmonic peak, which has two left and right troughs, the left trough value is v _p1 Its value is MAG [ k ]]A mean value within a trough range of the left harmonic spacing; the right wave valley value is v _p2 Its value is MAG [ k ]]A mean value within a trough range of the right harmonic spacing; in the calculation of the peak-to-valley ratio of the M-order harmonic _p1 The calculation formula of (2) is as follows:

wherein L is the left boundary of the trough value calculation and R is the right boundary of the trough value calculation;

the calculation formula of v2 in the calculation of the M-order harmonic peak-to-valley ratio is as follows:

the logarithmic magnitude sequence of the maximum point of each subharmonic peak is known as Pn _h ]The calculation formula of the Mth harmonic peak-to-valley ratio MR is as follows:

where PM represents the logarithmic magnitude of the maximum point of the peak of the Mth harmonic. Calculating the average value of harmonic peak-to-valley ratios of multiple harmonics as a characteristic parameter of the harmonic peak-to-valley ratios;

the first harmonic peak-to-valley ratio is a special case of calculating the Mth harmonic peak-to-valley ratio, and the steps are similar; in the calculation of the first harmonic peak-to-valley ratio v _p1 The calculation formula of (2) is as follows:

in the calculation of the first harmonic peak-to-valley ratio v _p2 The calculation formula of (2) is as follows:

the first harmonic peak-to-valley ratio FHPVR is calculated as follows:

the calculation of the harmonic energy reduction rate is divided into two cases that the number of the maximum harmonic times num with the value larger than zero is larger than four and smaller than or equal to four; in the first case, let head be the second largest log amplitude of the first third harmonic, its harmonic order be s, let tail be the mean value of the harmonics with log amplitude greater than zero among num-2, num-1, num subharmonics, in which case the calculation formula of the harmonic energy reduction rate HDR is:

If the logarithmic value is greater than zero and the maximum harmonic number num is less than or equal to four, the head is the maximum logarithmic magnitude value in the previous num-1 harmonic, the corresponding harmonic number is s, and tail is the num harmonic logarithmic magnitude value; the calculation formula of the harmonic energy reduction rate HDR in this case is:

in the calculation of the energy ratio of the odd harmonic and the even harmonic of the first 6 times, let the matrix HWB be HB _ij Odd harmonic and even harmonic energy ratio R of the first 6 harmonics _oe6 The calculation formula of (2) is as follows:

wherein the row-column indices of the matrix are all 1, wid is HB _ij The length of a row;

when the frequency domain logarithmic value is greater than the ratio of zero harmonic energy to total energy, the element less than or equal to zero in the matrix HWB is made to be zero to obtain a new matrix UPHB _ij The calculation formula for the frequency domain logarithmic value being greater than the ratio of zero harmonic energy to total energy UHTR is as follows:

where row and col are the matrices UPHB, respectively _ij Row and column numbers of (a);

the calculation of the harmonic peak-valley difference ratio frequency spacing requires the calculation of the harmonic trough value v and the harmonic peak maximum logarithmic value PM for the M-th harmonic wave]The absolute value of the difference is the M-order harmonic peak-to-valley difference ratio frequency spacing compared with the frequency spacing of the harmonic trough and peak, and the characteristic value is averaged for a plurality of harmonics to obtain the final harmonic peak-to-valley difference ratio frequency spacing value; calculating and averaging harmonics with harmonic peak log values greater than zero in fifth to eighth harmonics in the logarithmic spectrum; for a harmonic peak, which has two left and right troughs, the left trough value is v _p1 The right trough value is v _p2 ；v _p1 And v _p2 Is consistent with the algorithm in the M-order harmonic peak-to-valley ratio; thenThe calculation formula of the M-order harmonic peak-to-valley difference ratio frequency spacing DFR is as follows:

calculating DFR for a plurality of harmonic waves and taking an average value to obtain a harmonic crest valley difference ratio frequency interval value;

the first harmonic peak-to-valley difference ratio frequency spacing FDFR is a special case of the M-order harmonic peak-to-valley difference ratio frequency spacing, and has a left-to-right trough value v _p1 And v _p2 Calculating the peak-to-valley ratio of the first harmonic; the calculation formula is as follows:

the 1 st and 2 nd harmonic energy ratio is the ratio of the maximum logarithmic magnitude of the first harmonic to the maximum logarithmic magnitude of the second harmonic;

s402, extracting time domain fine features, including single-period fundamental frequency sequence standard deviation, single-period fundamental frequency sequence mean value, single-period fundamental frequency sequence median, single-period fundamental frequency sequence mean value and median distance, single-period fundamental frequency sequence unit time average passing times; let the time domain single period base frequency sequence obtained in step S3 be Fn _T ]Length of Len, n _T Representing the sequence number of a time domain single period. Single period fundamental frequency sequence mean f _mean The calculation formula of (2) is as follows:

standard deviation f of single period fundamental frequency sequence _std The calculation formula of (2) is as follows:

for F [ n ] _T ]Ordering is performed, the median f of the single-period base frequency sequence _median The calculation formula of (2) is as follows:

Single period fundamental frequency sequence mean and median distance f _mm The calculation formula of (2) is as follows:

f _mm ＝|f _mean -f _median i formula (45)

The calculation formula of the single-period fundamental frequency sequence per unit time over-average number Nc is as follows:

wherein T is ₁ Is the duration of the time-domain audio signal, u (t) is expressed as follows:

s403, extracting static characteristics of 13-dimensional Meier frequency cepstrum coefficients: firstly, preprocessing input audio data, secondly, performing FFT (fast Fourier transform) on the audio, obtaining an energy spectrogram of the audio by square values of FFT modular values, then filtering by using a Mel filter bank to obtain energy output of each filter, and finally, performing logarithmic operation on the filter output and obtaining Mel frequency cepstrum coefficient characteristics by discrete cosine transform DCT.

Further, in the step S5, the characteristic correlation and the coefficient of variation are evaluated, and the process of establishing the random forest model to classify the musical instruments is as follows:

s501, calculating the Pearson correlation coefficient among the elements of the feature vector and evaluating the feature effectiveness;

s502, calculating variation coefficients of elements of the feature vector and evaluating feature effectiveness;

s503, fusing the time-frequency fine features and the mel-frequency cepstrum coefficient features into new feature vectors, and sending the fused features into a random forest model for training, wherein five-fold cross verification is adopted during training;

The random forest model is processed as follows: randomly sampling the training set for a plurality of times by using a self-service method, wherein the training sample obtained by sampling is theta;

for each training sample θ, forming a corresponding decision tree; in each constructed decision tree, Q features are randomly selected from Q features of each training sample theta for classification, wherein Q is less than or equal to Q,and->Representation pairA downward rounding operation; calculating the non-purity of the selected q characteristics, and considering the characteristic with the minimum non-purity of the base as the optimal classification characteristic; according to the selected optimal classification characteristics, dividing the node where the optimal classification characteristics are located into two parts, and selecting sub-optimal classification characteristics from the residual characteristics; repeating the steps until the theta residual feature quantity is 0 or the classification effect of the current decision tree is optimal;

combining all 30 decision trees to obtain a random forest model, and obtaining a final classification result through a voting method.

The random forest model can process input samples with high-dimensional characteristics, and dimension reduction is not needed; the basic unit of the random forest model is a decision tree, and the final result is obtained through voting or averaging by combining a plurality of weak classifiers, so that the result of the whole model has higher accuracy and generalization performance.

Compared with the prior art, the invention has the following advantages and effects:

(1) Compared with the traditional large-scale description of the frequency domain features, the invention focuses on describing the frequency spectrum fine structure, provides a plurality of indexes for representing the fine features of frequency spectrum harmonic peaks, burrs, harmonic troughs and the like, and can more accurately represent the frequency spectrum structure for determining tone. The frequency domain fine features can further represent the frequency spectrum nuances caused by the performance and vibration starting characteristics of the musical instrument, the resonance cavity characteristics of the musical instrument, and the like. For example, small-size and violin logarithmic spectrograms have burrs around the harmonic peak maxima, which have a larger ratio than the harmonic peak maxima, whereas the harmonic peaks in the logarithmic spectrograms of pianos are very sharp, which is not the case. The frequency domain fine features may describe this spectral nuances more accurately than conventional frequency domain features.

(2) Compared with the traditional time domain feature, the time domain fine feature provided by the invention accurately marks the interval of the time domain single fundamental wave period and records the fundamental frequency change sequence, and the fine change of the fundamental frequency is described in an important way. For different musical instruments, the fundamental frequency stability of single-tone audio frequency is different, for example, the fundamental frequency stability is weaker than that of a piano string which freely vibrates after being struck by strings by small numbers of the fundamental frequency controlled by lip vibration, and the time domain fine features provided by the invention can accurately represent the difference, but the traditional time domain features do not pay attention to the characteristics.

(3) The musical instrument classification error rate of the random forest model input with the fusion features is reduced by 5.7 percent compared with that of a random forest model only using 13-dimensional mel frequency cepstrum coefficient static features.

(4) The application designs a cyclic iteration fundamental frequency estimation algorithm based on an autocorrelation algorithm and narrowband spectrum energy frequency estimation to carry out fundamental frequency estimation, and realizes the effect of accurately estimating fundamental frequency under the condition of low signal-to-noise ratio. According to the estimated fundamental frequency, the application designs a harmonic marking algorithm based on self-adaptive window search and a time domain single-period fundamental frequency sequence estimation algorithm to accurately mark the harmonic position of a frequency spectrum and acquire a single-period fundamental frequency sequence, so that the problem of harmonic deviation during harmonic marking is overcome on one hand, the accurate marking of the harmonic is realized, and on the other hand, the single-period fundamental frequency sequence is based on continuous time domain single period and can better represent the continuous change of the time domain fundamental frequency.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flow chart of a musical instrument classification method based on time-frequency fine analysis in an embodiment of the present application;

FIG. 2 is a flowchart of a cyclic iterative fundamental frequency estimation algorithm based on an autocorrelation algorithm and a narrowband spectral energy frequency estimation algorithm in an embodiment of the invention;

FIG. 3 is a flow chart of a harmonic tagging algorithm based on adaptive window searching in an embodiment of the present invention;

fig. 4 is a flowchart of a time domain single period baseband sequence estimation algorithm in an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Fig. 1 is a flowchart of a musical instrument classification method based on time-frequency fine analysis according to an embodiment of the present invention.

in s [ n ]]Is the input musical instrument single tone audio signal, w [ n ]]Is a window function, n is s [ n ]]And w [ n ]]Time domain index of (a)，N _f Is the frame length of the signal frame, z is the amount of shift of the window function, E _z Is short-time energy corresponding to the translation amount z; e corresponding to different translation amounts _z Forming a short-time energy sequence; setting a threshold value for short-time energy of a starting and stopping endpoint of the effective audio of the single-tone audio of the musical instrument, recording the midpoint of the current frame as the starting endpoint of the effective audio when the short-time energy of the current frame exceeds the threshold value of the starting endpoint, and recording the midpoint of the current frame as the ending endpoint of the effective audio when the short-time energy of the current frame is lower than the threshold value of the ending endpoint;

in this example, the detection of the starting point of the monophonic audio signal is mainly performed, and the short-time energy threshold value adopted is 0.3 times of the maximum value of the short-time energy of all the audio frames;

where τ represents the hysteresis, i.e., the amount of translation, x [ n ]]Shifting tau units to the left, multiplying the shifted tau units by the original signal, and finally accumulating and summing the multiplied results to obtain an autocorrelation value under the current hysteresis value; the autocorrelation value is maximum when the hysteresis value is zero. Finding out maximum value point except zero according to the relation between the autocorrelation value and different hysteresis values, wherein the difference between the maximum value point and the hysteresis value of the zero hysteresis value point is the approximate estimation of fundamental wave period, thereby estimating fundamental frequency f ₀ ；

S103, performing fast Fourier transform (Fast Fourier Transform, FFT) on the audio signal, and recording the frequency f corresponding to the maximum value of the current FFT _max The method comprises the steps of carrying out a first treatment on the surface of the Judging f according to the time domain autocorrelation result _max Whether it is an interfering component; if so, removing the component from the FFT spectrum and re-acquiring the frequency f corresponding to the maximum FFT value _max Repeating the above process until f is considered _max Not interfering components;

the fundamental frequency f estimated by autocorrelation in this example ₀ The FFT point number is 4096 points at less than 400 HzThe FFT point number is 2048 points when the frequency is more than or equal to 400 Hz;

s104, calculating a denoising frequency value f of a maximum peak corresponding frequency value of the fast Fourier transform by adopting narrowband spectrum energy frequency estimation _c Eliminating noise interference; calculating a denoised frequency value f of the FFT spectrum using the narrowband spectral energy frequency estimate with the index of the FFT spectrum array starting at 1 _c The formula of (2) is as follows:

obtaining the fundamental frequency estimated value as f ₁ :

f _s ＝f ₁ ×2 ^m formula (6)

Wherein the value of m is as follows:

ensuring new samplingThe sampling point number N cannot cover enough fundamental wave period due to the fact that the rate is not too large; re-executing the re-sampled time domain signal from step S102 until the fundamental frequency estimated value f ₁ Compared with the estimated value of the fundamental frequency in the last cycle, the variation amplitude is smaller than 0.001 or the cycle number reaches 50 times of limit, and the estimated value f of the fundamental frequency in the last time is output ₁ 。

FIG. 3 is a flow of an adaptive window search based harmonic tagging algorithm in an embodiment of the invention;

s201, detecting an audio endpoint of the musical instrument by adopting a short-time energy method, and intercepting an effective audio part; in this example, the detection of the starting point of the monophonic audio signal is mainly performed, and the short-time energy threshold value adopted is 0.3 times of the maximum value of the short-time energy of all the audio frames;

s202, estimating the value f of the fundamental frequency according to the step S1 ₁ Resampling the audio time domain signal to make the sampling rate of the audio be f ₁ To the power of (a), reduces spectrum leakage and converts the FFT spectrum to a logarithmic spectrum; in this embodiment, since a fine spectrum analysis is required, the FFT point number at the time of harmonic labeling is 16384 points.

l=p+bt-2 formula (10)

r＝P+1.09T ₁ Formula (11)

l-P≥0.7T ₁ formula (14)

r-P≤1.3T ₁ Formula (15)

r-P≤1.3T ₁ formula (16)

S3, estimating a time domain single period fundamental frequency sequence according to the fundamental frequency of the audio frequency, wherein the process comprises the following steps:

FIG. 4 is a flow chart of a time domain single period baseband sequence estimation algorithm in an embodiment of the invention;

s301, detecting an audio endpoint of a musical instrument by adopting a short-time energy method, and intercepting an effective audio part; in this example, the detection of the starting point of the monophonic audio signal is mainly performed, and the short-time energy threshold value adopted is 0.3 times of the maximum value of the short-time energy of all the audio frames;

s302, filtering low-frequency interference by using a high-pass filter. The cut-off frequency is set in this example to the fundamental frequency estimate f of step S1 ₁ Is 0.3 times as large as the above.

S303, marking a demarcation point of a single-base wave period by using a cross-correlation operation, and taking a certain point o in the middle of an audio signal as an origin, extending towards two sides at equal intervals, wherein the extending distance is equal to the length of the single-base wave period obtained by the last operation, the signal in the extending range is a shift vector, and the length of the shift vector is twice the length of the single-base wave period obtained by the last operation, so that the self-adaptive adjustment is characterized; if the first cross-correlation operation is performed, the single period length refers to the fundamental frequency estimation result in the step S2; taking the o point as an origin, and taking signals with four single-base wave period lengths to the left or right as fixed vectors, wherein the single-base wave period length is also the single-base wave period length obtained by the last operation; the shift vector is used as a sliding window and a fixed vector to carry out cross-correlation operation, so that the corresponding relation between a cross-correlation value and relative displacement is obtained; recording extreme point sequences and corresponding relative displacement sequences of the cross-correlation values from the corresponding relations; selecting a reasonable value from the relative displacement sequence, and calculating a next single-base wave period separation point by taking the o point as a reference, wherein the distance between the separation point and the o is the single-base wave period length; and circularly reciprocating to obtain a time domain single period fundamental frequency sequence, wherein the calculation formula of the cross-correlation r [ tau ] is as follows:

S4, extracting time-frequency fine features and Mel frequency cepstrum coefficient features, wherein the process is as follows:

calculation of the average reduction ratio of harmonic peaks and burrs requires that, in the case where the logarithmic spectrum of the FFT spectrum in step S2 is greater than zero, twenty non-harmonic peak maxima A [ n ] closest to the harmonic peak maximum in a harmonic spacing range centered on the harmonic peak maximum are obtained _p ]，n _p Is A [ n ] _p ]If the number of maxima in the range is less than twenty, then takingAll of them; if the number of maximum values is less than or equal to 10, directly taking an _p ]The mean value of (2) is taken as the average amplitude of burrs around the harmonic wave crest; if the maximum number is greater than 10, then at an _p ]Searching a maximum value group with the length of 10 continuous, wherein the FFT index span of the maximum value group comprises an index where the maximum value of the harmonic peak is located or no other maximum value exists between the index and the index where the maximum value of the harmonic peak is located and the sum is maximum; obtaining the average value of the maximum value group as the average amplitude of burrs around the harmonic peaks; let the average amplitude of the burrs around the harmonic peaks be a, the harmonic peak burr attenuation ratio PDR of the single harmonic is as follows:

for harmonic peak-to-valley ratio baseCalculating the part with the logarithmic value larger than zero in the FFT logarithmic spectrum to make mag [ k ]]The elements less than or equal to zero are zero, and a new number array MAG [ k ] is obtained]The method comprises the steps of carrying out a first treatment on the surface of the For a harmonic peak, which has two left and right troughs, the left trough value is v _p1 Its value is MAG [ k ]]A mean value within a trough range of the left harmonic spacing; the right wave valley value is v _p2 Its value is MAG [ k ]]A mean value within a trough range of the right harmonic spacing; in the calculation of the peak-to-valley ratio of the M-order harmonic _p1 The calculation formula of (2) is as follows:

/>

the first harmonic peak-to-valley ratio FHPVR is calculated as follows:

the calculation of the harmonic peak-valley difference ratio frequency spacing requires the calculation of the harmonic trough value v and the harmonic peak maximum logarithmic value PM for the M-th harmonic wave]The absolute value of the difference is the M-order harmonic peak-to-valley difference ratio frequency spacing compared with the frequency spacing of the harmonic trough and peak, and the characteristic value is averaged for a plurality of harmonics to obtain the final harmonic peak-to-valley difference ratio frequency spacing value; calculating and averaging harmonics with harmonic peak log values greater than zero in 5 th to 8 th harmonics in the logarithmic spectrum; for a harmonic peak, which has two left and right troughs, the left trough value is v _p1 The right trough value is v _p2 ；v _p1 And v _p2 Is consistent with the algorithm in the M-order harmonic peak-to-valley ratio; the calculation formula of the M harmonic peak-to-valley difference ratio frequency spacing DFR is as follows:

s402, extracting time domain fine features, including single-period fundamental frequency sequence standard deviation, single-period fundamental frequency sequence mean value, single-period fundamental frequency sequence median, single-period fundamental frequency sequence mean value and median distance, single-period fundamental frequency sequence unit time average passing times; let the time domain single period base frequency sequence obtained in step S3 be Fn _T ]The length of the fiber is Len,n _T representing the sequence number of a time domain single period. Single period fundamental frequency sequence mean f _mean The calculation formula of (2) is as follows:

f _mm ＝|f _mean -f _median i formula (45)

Single period base frequency sequence unit time over average number N _c The calculation formula of (2) is as follows:

s403, extracting static characteristics of 13-dimensional Meier frequency cepstrum coefficients: firstly, preprocessing input audio data, secondly, performing FFT (fast Fourier transform) on the audio, obtaining an energy spectrogram of the audio by square values of FFT modular values, then filtering by using a Mel filter bank to obtain energy output of each filter, and finally, performing logarithmic operation on the filter output and obtaining Mel frequency cepstrum coefficient characteristics by discrete cosine transform (Discrete Cosine Transform, DCT).

S5, based on the extracted video fine features and the Mel frequency cepstrum coefficient features, performing feature correlation and variation coefficient evaluation, and establishing a random forest model to classify musical instruments as follows:

s503, fusing the time-frequency fine features and the mel frequency cepstrum coefficient features into new feature vectors, inputting the fused features of 822 audios of 5 musical instruments into a random forest model for training, and adopting five-fold cross verification during training; then, the static characteristics of 13-dimensional Meier frequency cepstrum coefficient are independently input into a random forest model for training, and five-fold cross verification is adopted during training; the random forest can process input samples with high-dimensional characteristics, and dimension reduction is not needed; the basic unit of the random forest is a decision tree, and the final result is obtained by voting or averaging through combining a plurality of weak classifiers, so that the result of the overall model has higher accuracy and generalization performance; the final instrument classification error rate of the random forest model input the fusion features obtained in this example was reduced by 6% compared to the random forest model using only 13-dimensional mel frequency cepstrum coefficient static features, as shown in table 1.

TABLE 1 feature class and Classification accuracy Meter

Example 2

Based on the musical instrument classification method based on the time-frequency fine analysis disclosed in the above embodiment 1, the present embodiment continues to provide a classification implementation process of the musical instrument classification method based on the time-frequency fine analysis, which specifically includes the following steps:

s1, referring to the corresponding steps in the embodiment 1, the fundamental frequency f of the input violin single-tone audio is obtained ₁ ；

S2, recording all FFT logarithmic magnitude values and index values of maximum value points in 0.1 times harmonic spacing with the maximum value point of each subharmonic as the center, respectively storing the values into a matrix HWB and a series HWI, and storing each harmonic peak logarithmic value and the frequency spectrum logarithmic value nearby each line of the HWB;

s3, referring to the corresponding steps in the embodiment 1, the description is omitted here;

s4, extracting time-frequency fine features and 13-dimensional Meyer frequency cepstrum coefficient static features, and automatically calculating each feature by adopting MATLAB codes to obtain each feature value of the audio as shown in the following table.

TABLE 2 time-frequency Fine characterization and 13-dimensional Meyer frequency cepstrum coefficient static characterization of corresponding Audio

/>

S5, inputting the characteristics into a random forest model of the fusion characteristics based on the first embodiment, and obtaining a classifying result of the violin by inputting the characteristics into the model, wherein the classifying result is correct.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A musical instrument classification method based on time-frequency fine analysis, characterized in that the musical instrument classification method comprises the following steps:

2. The musical instrument classification method based on time-frequency fine analysis according to claim 1, wherein the step S1 process is as follows:

s105, according to the denoising frequency value f _c And autocorrelation estimated basisFrequency f ₀ Calculate that the maximum peak is located at about n _m Subharmonic; wherein:

obtaining the fundamental frequency estimated value as f ₁ :

f _s ＝f ₁ ×2 ^m formula (6)

Wherein the value of m is as follows:

re-executing the re-sampled time domain signal from step S102 until the fundamental frequency estimated value f ₁ Compared with the estimated value of the fundamental frequency in the last cycle, the variation amplitude is smaller than 0.001 or the cycle number reaches 50 times of limit, and the estimated value f of the fundamental frequency in the last time is output ₁ 。

3. The method according to claim 1, wherein in the step S2, the FFT frequency domain harmonic labeling process according to the size of the fundamental frequency of the audio is as follows:

s202, estimating the value f of the fundamental frequency according to the step S1 ₁ Resampling the audio time domain signal to make the sampling rate of the audio be f ₁ To the power of (a), and converting the FFT spectrum to a logarithmic spectrum;

l=p+bt-2 formula (10)

r＝P+1.09T ₁ Formula (11)

Wherein bT is half the distance between the 3 rd harmonic peak and the 1 st harmonic peak on the FFT spectrum; t (T) ₁ The two-harmonic FFT point number interval is the closest to the current marked harmonic of the FFT spectrum;

s204, all maximum points are obtained in the search range, and the maximum point of all the maximum points is obtained as the obtained peak point of the current harmonic; if no maximum point exists in the search range, the left and right boundaries of the search range are expanded when the first 4 times of harmonic waves are marked, and only the right boundary is expanded when the other times of harmonic waves are marked; when expanding the search boundary of the first two harmonics, the left and right endpoints l and r satisfy the following conditions:

l-P≥0.7T ₁ formula (14)

r-P≤1.3T ₁ Formula (15)

r-P≤1.3T ₁ formula (16)

Then, the maximum value point of all maximum value points is calculated in the new searching range and is used as the calculated peak value point of the current harmonic, all FFT logarithmic magnitude values and index values of the maximum value points in the harmonic interval of 0.1 times with the maximum value point as the center are recorded and respectively stored in a matrix HWB and a series HWI, and each row of HWB stores the logarithmic value of each harmonic peak and the frequency spectrum logarithmic value nearby the logarithmic value of each harmonic peak; if the maximum point is still not found, the search range is continuously expanded, and the loop is repeated until the maximum point is found.

4. The musical instrument classification method based on time-frequency fine analysis according to claim 1, wherein in the step S3, the process of estimating the time-domain single-period fundamental frequency sequence according to the fundamental frequency size of the audio is as follows:

s302, filtering low-frequency interference by using a high-pass filter;

s303, marking a demarcation point of a single-base wave period by using a cross-correlation operation, and taking a certain point o in the middle of an audio signal as an origin, extending towards two sides at equal intervals, wherein the extending distance is equal to the length of the single-base wave period obtained by the last operation, the signal in the extending range is a shift vector, and the length of the shift vector is twice the length of the single-base wave period obtained by the last operation, so that the self-adaptive adjustment is characterized; if the first cross-correlation operation is performed, the single period length is the fundamental frequency estimation result of the step S2; taking the o point as an origin, and taking signals with four single-base wave period lengths to the left or right as fixed vectors, wherein the single-base wave period length is also the single-base wave period length obtained by the last operation; performing cross-correlation operation by taking the shift vector as a sliding window and a fixed vector to obtain a corresponding relation between a cross-correlation value and relative displacement; recording extreme point sequences and corresponding relative displacement sequences of the cross-correlation values from the corresponding relations; selecting a reasonable value from the relative displacement sequence, and calculating a next single-base wave period separation point by taking the o point as a reference, wherein the distance between the separation point and the o is the single-base wave period length; and circularly reciprocating to obtain a time domain single period fundamental frequency sequence, wherein the calculation formula of the cross-correlation r [ tau ] is as follows:

5. The musical instrument classification method based on time-frequency fine analysis according to claim 4, wherein in the step S4, the process of extracting the time-frequency fine feature and mel-frequency cepstrum coefficient feature is as follows:

from the HWB obtained in step S2, a log-amplitude series P [ n ] of the maximum value point of each subharmonic peak can be obtained _h ]Setting the number smaller than or equal to zero in the logarithmic magnitude array to zero, calculating the index from 1, and calculating the harmonic energy centroid c with the logarithmic value larger than zero by the following formula:

the calculation of the average attenuation ratio of harmonic peaks and burrs requires that 20 non-harmonic peak maxima A [ n ] closest to the harmonic peak maximum in a harmonic spacing range centered on the harmonic peak maximum be obtained in the case where the logarithmic spectrum of the FFT spectrum in step S2 is greater than zero _p ]，n _p Is A [ n ] _p ]If the number of maxima in the range is less than 20, taking all; if the number of maximum values is less than or equal to 10, directly taking an _p ]The mean value of (2) is taken as the average amplitude of burrs around the harmonic wave crest; if the maximum number is greater than 10, then at an _p ]Searching a maximum value group with the length of 10 continuous, wherein the FFT index span of the maximum value group comprises an index where the maximum value of the harmonic peak is located or no other maximum value exists between the index and the index where the maximum value of the harmonic peak is located and the sum is maximum; obtaining the average value of the maximum value group as the average amplitude of burrs around the harmonic peaks; let the average amplitude of the burrs around the harmonic peaks be a, the harmonic peak burr attenuation ratio PDR of the single harmonic is as follows:

let the logarithmic spectrum of the FFT spectrum obtained in step S2 be mag [ k ]]HWI as determined in step S2 is mag [ k ]]The index sequence where the maximum value of the medium harmonic peak is located is called hi n for short _h ]The indexes of the series all start from 1, then the 3 rd and 4 th harmonicsThe calculation formula of the inter-slope bandwidth ratio SFWR is as follows:

the first harmonic peak-to-valley ratio FHPVR is calculated as follows:

the calculation of the harmonic peak-valley difference ratio frequency spacing requires the calculation of the harmonic trough value v and the harmonic peak maximum logarithmic value PM for the M-th harmonic wave]The absolute value of the difference is the M-order harmonic peak-to-valley difference ratio frequency spacing compared with the frequency spacing of the harmonic trough and peak, and the characteristic value is averaged for a plurality of harmonics to obtain the final harmonic peak-to-valley difference ratio frequency spacing value; calculating and averaging harmonics with harmonic peak log values greater than zero in fifth to eighth harmonics in the logarithmic spectrum; for a harmonic peak, which has two left and right troughs, the left trough value is v _p1 The right trough value is v _p2 ；v _p1 And v _p2 Is consistent with the algorithm in the M-order harmonic peak-to-valley ratio; the calculation formula of the M harmonic peak-to-valley difference ratio frequency spacing DFR is as follows:

s402, extracting time domain fine features, including single-period fundamental frequency sequence standard deviation, single-period fundamental frequency sequence mean value, single-period fundamental frequency sequence median, single-period fundamental frequency sequence mean value and median distance, single-period fundamental frequency sequence unit time average passing times; let the time domain single period base frequency sequence obtained in step S3 be Fn _T ]Length is Len, n _T A sequence number representing a time domain single period; single period fundamental frequency sequence mean f _mean The calculation formula of (2) is as follows:

f _mm ＝|f _mean -f _median i formula (45)

6. The method for classifying musical instruments based on time-frequency fine analysis according to claim 1, wherein in step S5, the characteristic correlation and the coefficient of variation are evaluated, and the process of classifying musical instruments by establishing a random forest model is as follows:

The processing process of the random forest model is as follows: randomly sampling the training set for a plurality of times by using a self-service method, wherein the training sample obtained by sampling is theta;

for each training sample θ, forming a corresponding decision tree; in each constructed decision tree, Q features are randomly selected from Q features of each training sample theta for classification, wherein Q is less than or equal to Q,and->Representation pair->A downward rounding operation; calculating the non-purity of the selected q characteristics, and considering the characteristic with the minimum non-purity of the base as the optimal classification characteristic; according to the selected optimal classification characteristics, dividing the node where the optimal classification characteristics are located into two parts, and selecting sub-optimal classification characteristics from the residual characteristics; repeating the steps until the theta residual feature quantity is 0 or the classification effect of the current decision tree is optimal;