CN112259063B

CN112259063B - Multi-pitch estimation method based on note transient dictionary and steady state dictionary

Info

Publication number: CN112259063B
Application number: CN202010934594.9A
Authority: CN
Inventors: 韦岗; 姜云华; 曹燕; 王一歌
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-09-08
Filing date: 2020-09-08
Publication date: 2023-06-16
Anticipated expiration: 2040-09-08
Also published as: CN112259063A

Abstract

The invention discloses a multi-pitch estimation method based on a note transient dictionary and a steady state dictionary. The note event detection step comprises the steps of detecting note event occurrence, extracting starting time and ending time of notes of the note event occurrence, and dividing music into a plurality of note event segments, wherein each note event segment comprises one or more notes; the multi-pitch estimation step includes extracting the number of notes of each note event segment and a pitch value thereof; according to the invention, octaves are utilized to divide audio into a plurality of frequency bands, and the energy weighted power scaling frequency spectrum flux of each frequency band is subjected to windowing and difference solving processing, so that weaker starting sound in signals is enhanced, stronger starting sound is restrained, and more accurate starting point detection is achieved; utilizing note time variability, constructing dictionary from transient and steady state phases thereof, by applying to transient and steady state phases of note event segments

The matrix decomposition method of norm sparse constraint improves the performance of multi-pitch estimation.

Description

Multi-pitch estimation method based on note transient dictionary and steady state dictionary

Technical Field

The invention relates to the technical fields of note starting point detection, multi-pitch estimation and non-negative matrix factorization of piano music, in particular to a multi-pitch estimation method based on a note transient dictionary and a steady state dictionary.

Background

Music is an important means for expressing emotion and transmitting information to each other, is the expression of human intelligence and perceptual thinking, and a plurality of ideological emotions which cannot be accurately described by language can be expressed through the music, so the music is an important component of people's ideological communication and emotion expression.

With the development of computer networks and multimedia technology, the creation and dissemination of digital music is continually advancing. The rapid development of the digital music industry and the emergence of massive music data make the technologies of music classification, retrieval, recommendation, content analysis and the like more and more important, and become a research hotspot in the field of digital audio processing. In the era of digitization and informatization, the most important bottom means for obtaining various music is through music information retrieval (Music Information Retrieval, MIR) technology. The accuracy of note starting point detection and multi-pitch estimation lays a foundation stone in the field of digital audio processing, and plays a very important role in further analysis of music structure analysis, music retrieval, music style classification, singing recognition and the like.

In percussion instruments, the note change process can be generally divided into two stages, transient and steady, and further into four basic stages, namely Attack (Attack), decay (Decay), sustain (hold), and disappearance (Release). The amplitude from the attack stage to the decay stage changes relatively fast with time, and is a transient stage; the amplitude of the hold phase to the vanishing phase changes smoothly with time, and is a steady-state phase. Note onset detection is to detect the moment at which the note onset begins, while multi-pitch estimation is to extract the number of notes played at the same time and their pitch values.

For extracting the number of notes played at the same time and their pitch values from the piano music in wav format, it is first necessary to detect the start time and the end time of the notes, taking the interval between the start and end points of the notes as one note event segment, each note event segment containing one or more notes played at the same time. The following methods are used for detecting the starting point of a note: based on short-time energy and zero-crossing rate, based on phase characteristics, based on spectral differences and high frequency weighted component analysis, etc. The method is characterized in that the note starting point is judged by short-time energy and zero crossing rate, so that the accuracy is low; the method based on the phase characteristics is easily affected by low-frequency energy noise and is not suitable for the complex tone music of a piano; the method based on the spectrum difference needs to process a large number of frequency points, is large in calculated amount, and is difficult to detect weaker starting sound; the high frequency weighted component analysis method may cause weak notes at low frequencies to be difficult to detect due to the large weighting coefficients given to the high frequencies.

Secondly, the note event segments obtained by note event detection need to be subjected to multi-pitch estimation, and the note number and the pitch value of each note event segment are extracted. The conventional multi-pitch estimation method is divided into a characteristic method, a statistical model method and a spectrum decomposition method. However, these methods have respective defects, in which the feature method usually adopts a fixed screening rule, and cannot adaptively process the file to be tested; the calculation complexity of the statistical model method is higher, the octave error phenomenon is more prominent, and the estimation result of the number of notes is not ideal; conventional spectral decomposition methods typically use short-time fourier transform spectra for matrix decomposition and use single atoms in a dictionary of note features to construct a dictionary, i.e., a note is represented by a spectrum-based atom, thereby failing to fully utilize the time-varying spectral information during the note change.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provide a multi-pitch estimation method based on a note transient dictionary and a steady-state dictionary, which comprises a note event detection step and a multi-pitch estimation step. The music note event detection step utilizes octaves of a piano to divide audio into 6 sub-bands, takes the result of windowing and differencing processing of Power scaling spectrum flux (Power-Scaled Spectral Flux) weighted by energy of each sub-band as a starting point detection function, enhances weaker starting sound in a music signal and suppresses stronger starting sound, and simultaneously enhances the robustness of a detection method to achieve more accurate detection of the starting point of a music note; the multi-pitch estimation step constructs a note transient characteristic dictionary base matrix and a steady characteristic dictionary base matrix by utilizing the frequency spectrum characteristics of the transient stage and the steady stage of a single note sample according to the time variability of notes, and applies the frequency spectrum of the transient stage and the steady stage of a note event segment to be detected

According to the non-negative matrix factorization method of the norm sparse constraint, coefficient matrixes of corresponding parts are obtained, and therefore performance of piano multi-note estimation is improved.

The aim of the invention can be achieved by adopting the following technical scheme:

a multi-pitch estimation method based on a note transient dictionary and a steady state dictionary comprises the following steps:

a note event detection step of inputting a piece of piano music in wav format, detecting occurrence of a note event by using an octave filter bank, extracting start time and end time of a note thereof, taking a note between the start and end points of the note as a note event segment, dividing the piece of piano music in wav format into a plurality of note event segments, each note event segment containing one or a plurality of notes to be played at the same moment;

and a multi-pitch estimation step, wherein the number of notes and the pitch value of each note event segment are extracted, and a note sequence containing multi-notes is spliced according to a time sequence.

Further, the octave filter bank in the note event detecting step includes a plurality of band pass filters, wherein the number of band pass filters and the cutoff frequency of each band pass filter are determined by the number of piano keys and the octave twelve-tone average law.

Further, since the standard piano is a twelve-tone rhythm musical instrument, there are 88 keys, and the pitch of each key is fixed according to twelve-tone rhythm; meanwhile, the fundamental frequency distribution of the first three octaves of the piano is similar and the attenuation condition of the band-pass filter at the interval frequency of the first three octaves is considered, and the first three octaves are integrated into one frequency band. Thus, the number of band pass filters in the octave filter bank is 6, and the cut-off frequency is the fundamental frequency mean of two adjacent tones separated by each octave.

Further, the working process of the note event detection step is as follows:

s101, inputting piano music in a wav format, dividing audio into 6 sub-bands after passing through an octave filter bank, carrying out normalization, framing and windowing, Q-switching (Variable Q Transform, VQT) and power scaling on each sub-band, obtaining a power scaling energy spectrum of each frame, calculating first-order differences of adjacent frames, and adding all lines of a specific frame to obtain a novel Function (Novelty Function) of the sub-band;

s102, calculating root mean square energy of each sub-band in each frame, summing to obtain energy of the sub-band, taking energy specific weight of each sub-band relative to the whole music band as a weight coefficient, and weighting and summing novel functions of each sub-band;

s103, windowing the weighted and summed novel function, summing each frame in a window, performing first-order difference on each frame, calculating the difference value between the weighted and summed novel function and the windowed first-order difference, and taking the result as a note starting point detection function;

s104, carrying out normalization processing on the starting point detection function and detecting a peak value of the starting point detection function, wherein the time corresponding to the peak value is the starting point time of a note, setting a time threshold value, and combining the starting points of notes with adjacent time differences smaller than the threshold value to serve as a note starting point;

s105, setting a threshold value according to the short-time energy of a first frame from a note starting point, judging frame by frame, if a frame with short-time energy smaller than the threshold value is found, regarding the frame as a note ending point, and if the short-time energy of all frames before a second note starting point is larger than the threshold value, regarding the second note starting point as the ending point of the first note;

s106, taking the starting point and the end point of each note as a note event segment, wherein the note event segment comprises one or more notes played at the same moment.

Further, the working process of the multi-pitch estimation step is as follows:

s201, constructing a note transient characteristic dictionary base matrix and a note steady characteristic dictionary base matrix by utilizing spectrum characteristics of a transient stage and a steady stage of a single note sample;

s202, according to the obtained note transient characteristic dictionary base matrix and note steady-state characteristic dictionary base matrix, normalizing frequency spectrum application based on transient and steady-state phases of a note event segment to be identified

The non-negative matrix factorization method of the norm sparsity constraint performs multi-note result estimation, i.e., detects the number of notes present in a note event segment and their pitch values.

Further, the process of step S201 is as follows:

s2011, respectively performing Q-switching conversion on transient and steady states of 88 piano single-symbol sample signals to obtain respective frequency spectrum matrixes, and performing normalization processing;

s2012, respectively carrying out non-negative matrix factorization estimation on the normalized frequency spectrum matrixes in the transient stage and the steady state stage to obtain spectrum base atoms of the note sample, thereby obtaining a frequency spectrum feature matrix of the note in the corresponding stage;

s2013, carrying out normalization processing on each spectrum base atom of each note feature matrix;

s2014, sequencing the notes according to the order of the pitches of 88 notes of the piano from small to large, and respectively splicing to obtain a note transient characteristic dictionary base matrix and a note steady characteristic dictionary base matrix in a note transient stage and a note steady state stage.

Further, the process of step S202 is as follows:

s2021, respectively performing Q-switching transformation on a transient stage and a steady-state stage of the note event segment obtained in the note event detection step to obtain respective frequency spectrum matrixes, and performing normalization processing;

s2022, carrying out normalized transient stage and steady stage frequency spectrum matrix by adopting note transient characteristic dictionary base matrix and note steady characteristic dictionary base matrix

Non-negative matrix factorization estimation of norm sparse constraint is carried out to obtain coefficient matrixes of corresponding stages, and the coefficient matrixes of the note event segments are obtained by adding;

s2023, performing threshold post-processing on the obtained coefficient matrix, and judging that a single note exists in the note event segment when the coefficient of the single note exceeds a set threshold value;

s2024, splicing the note sequences containing the multi-notes according to the time sequence, so as to obtain a multi-pitch estimation result.

Compared with the prior art, the invention has the following advantages and effects:

1) In the note event detection step, the occurrence of a note event is detected, the starting time and the ending time of a note are extracted, and the fact that the starting of a note with a certain low frequency is very soft is considered, so that the weaker starting and the stronger starting in a music signal are enhanced, and the robustness of a detection method is enhanced by adopting frequency band segmentation weighting and power scaling; further, through windowing and differencing processing, the correctly detected note onset is increased, and false alarms can be reduced, so that the overall detection performance is improved.

2) In the multi-pitch estimation step, the multi-pitch estimation is based on the decomposition estimation of the note event segment of the note event, the striking moment, the pitch value and the ending moment of the note in the piano playing process are considered in advance, and the false playing error is avoided; taking into account the continuity of the music and the time variability of the notes, the notes are divided into two main phases: transient and steady-state phases, respectively processing transient and steady-state phases of note event segments, and using a multi-atomic note feature dictionary and introduction

The norm sparsity constraint realizes higher precision estimation while maintaining a rapid process based on a non-negative matrix factorization method.

Drawings

FIG. 1 is a flow chart of a multi-pitch estimation method based on a transient dictionary and a steady state dictionary of notes disclosed in the present invention;

FIG. 2 is a graph of the frequency response of an octave filter bank in an embodiment of the invention;

FIG. 3 is a schematic diagram of the composition of the phonetic symbols at each stage in an embodiment of the invention;

FIG. 4 is a diagram of note event divisions for a portion of audio in an embodiment of the present invention;

FIG. 5 is a block diagram of the steps of detecting a phonetic symbol event in an embodiment of the present invention;

FIG. 6 is a block diagram of the construction of a phonetic symbol feature dictionary base matrix in a multiple pitch estimation step in an embodiment of the present invention;

FIG. 7 is a block diagram of a spectral decomposition of a note event segment in a multiple pitch estimation step in an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

As shown in fig. 1, a flowchart of a multi-pitch estimation method based on a note transient dictionary and a steady state dictionary in the present embodiment is disclosed. The method comprises the following specific steps:

a note event detecting step of inputting a piece of piano music in wav format, detecting occurrence of note event by using an octave filter bank, extracting start time and end time of a note thereof, taking a space between the start and end points of the note as a note event segment (a note event segment is formed between the start and end points of the note as shown in fig. 4), dividing the piece of piano music in wav format into a plurality of note event segments each containing one or a plurality of notes played at the same time;

As shown in fig. 2, is a frequency response diagram of an octave filter bank in an embodiment of the invention. It includes a plurality of band-pass filters, wherein the number of band-pass filters and the cut-off frequency of each band-pass filter are determined by the number of piano keys and the octave twelve-tone average law. Since a standard piano is a twelve-tone rhythm musical instrument, there are 88 keys, and the pitch of each key is fixed according to twelve-tone rhythm; consider that the fundamental frequency of the first three octaves of the piano is distributed between 27.5Hz and 207.652Hz, and that the band-pass filter attenuates more at the frequencies of the three octaves apart, integrating the first three octaves into one band. Thus, the number of bandpass filters of the filter bank is 6, and the cut-off frequency is the fundamental frequency mean of two adjacent tones separated by each octave.

FIG. 3 is a schematic diagram showing the composition of each stage of the note according to the embodiment of the present invention, which is represented by the amplitude change from the beginning to the end of the note. The change process of notes is generally divided into transient and steady state phases, and further into four basic phases, namely attack, decay, hold and disappearance. The amplitude from the initial stage to the decay stage of the notes changes rapidly with time, and the transient stage is the transient stage; the amplitude of the hold phase to the vanishing phase changes smoothly with time, and is a steady-state phase. The starting point of the transient phase in the figure is the starting point of the note, and the object of the note event detection step is to detect this starting point.

As shown in fig. 5, the present invention is a block diagram of a voice symbol event detection step, which specifically includes the following steps:

s101, inputting piano music in a wav format, and dividing the audio into 6 sub-bands after passing through an octave filter bank;

s102, carrying out pretreatment such as normalization, framing and windowing on each sub-band, wherein the parameter setting considers that the sampling rate of the music in wav format is generally 44.1kHz, and a Hamming window is used for signals in order to avoid spectrum leakage, so that 2048 sampling points are taken for the window length, 512 sampling points are taken for the frame shift, the time difference of adjacent frames is about 11.6 milliseconds, namely the error time of a predicted value and an actual result is at most 11.6 milliseconds;

s103, performing Q-switching transformation on the preprocessed signals to obtain corresponding frequency spectrums X (K, N), and performing normalization processing, wherein K represents the frequency number of each frame, and N represents the frame number. In the conventional time-frequency transformation, the short-time Fourier transformation adopts a fixed length window, so that the problem of frequency resolution is easy to generate, meanwhile, the frequency points of the short-time Fourier transformation are linearly distributed and cannot be in one-to-one correspondence with the frequencies of piano notes in exponential distribution, and the estimation error of certain note frequencies is larger; the frequency bin distribution of the constant Q transform is not linear but exponential, but the time resolution is relatively low due to the large number of sampling points chosen at low frequencies. Therefore, a variable Q transform is employed that also has a higher temporal resolution at low frequencies, whose k-th frequency component has a bandwidth as follows:

δ _k ＝α·f _k +gamma formula (2)

Wherein f _k Is the center frequency of frequency band k; delta _k Is the bandwidth of the kth frequency component; f (f) _min Analyzing the lowest frequency in the frequency interval of the music signal, and taking 27.5Hz; b is the number of frequency points in each octave, which is usually a multiple of 12, and 48 is taken in order to obtain richer spectral feature information; k is the number of frequency bands divided in the variable Q transform spectrum;

is a constant value and is only related to the value of b; gamma is the compensation parameter, when gamma>0, increasing the time resolution at low frequencies;

s104, carrying out power scaling on each normalized frame spectrum, calculating the first-order difference of adjacent frames, adding all rows of a specific frame to obtain a novel function of the sub-band, and adopting the following formula:

SF(k,n)＝|X(k,n+1)| ^p -|X(k,n)| ^p formula (3)

Wherein SF (k, n) is expressed as the spectral flux of adjacent power scaling energy spectra; x (k, n) represents a variable Q transform spectrum of the input signal; NF (NF) _i (n) a novel function represented as the subband; |x| ^p Represents the power of p of the element in x; h (x) = (x+|x|)/2 is represented as a half-wave rectification function; k represents the frequency number of each frame; p represents a power scaling factor, typically taken as 0.5;

s105, calculating root mean square energy of each frame of each sub-band and summing to obtain energy of the sub-band, taking energy specific weight of each sub-band relative to the whole music band as a weight coefficient, weighting and summing novel functions of each sub-band, and adopting the following formula:

wherein omega _i Is the energy specific gravity of the ith sub-band relative to the whole music band, i.e. the weight coefficient; NF (NF) _i (n) is a novel function of the ith subband; RMSE _i Is the root mean square energy of the ith sub-band; NF (n) is a novel function after weighted summation;

s106, windowing the weighted and summed novel function NF (n), carrying out first-order difference on each frame in the summing window, then calculating the difference value between the weighted and summed novel function and the windowed first-order difference, taking the result as a note starting point detection function, and adopting the following formula:

ODF (n) =nf (n) - { w (n) -w (n-1) } formula (8)

Wherein W (n) is the sum of the length W frame windows after frame n for the novel function; ODF (n) is a note origin detection function;

s107, carrying out normalization processing on the starting point detection function and detecting a peak value of the starting point detection function, wherein the time corresponding to the peak value is the starting point time of a note, setting a time threshold value, and combining the starting points of notes with adjacent time differences smaller than the threshold value to serve as a note starting point, wherein the time threshold value is usually 50 milliseconds;

s108, setting a threshold according to the short-time energy of a first frame from a note starting point, judging from frame to frame, if a frame with short-time energy smaller than the threshold is found, then regarding the frame as a note end point, and if the short-time energy of all frames before a second note starting point is larger than the threshold, regarding the second note starting point as the end point of the first note, wherein the threshold is usually one tenth of the short-time energy of the first frame;

s109, taking the starting point and the end point of each note as a note event segment, wherein the note event segment comprises one or more notes played at the same moment.

Fig. 6 is a block diagram of a phonetic symbol feature dictionary base matrix construction in a multi-pitch estimation step according to an embodiment of the present invention, and the specific process is as follows:

s2011, respectively performing Q-switching conversion on transient phases and steady phases of 88 piano single symbol sample signals X (n) to obtain respective frequency spectrum matrixes X _m×n And performing normalization processing, wherein m=1, …, M represents a frequency number, n=1, …, N represents a frame number;

s2012, respectively carrying out frequency spectrum matrix X on the transient stage and the steady-state stage after normalization _m×n And carrying out nonnegative matrix factorization to obtain spectrum base atoms of the corresponding stage of the note sample. Assume that the spectrum matrix of the corresponding stage of the kth note sample is expressed as

For matrix->

Performing non-negative matrix factorization, and taking the rank as r:

wherein, the liquid crystal display device comprises a liquid crystal display device,

a spectral feature matrix representing a corresponding stage of the note sample; />

A coefficient matrix corresponding to the spectral feature matrix representing the corresponding stage of the note sample; rank r represents the number of atoms in constructing a dictionary. The decomposition operation adopts a cost function based on beta divergence, as shown in a formula (10); to get->

Firstly, random non-negative initialization is adopted, and then, the matrix is iteratively updated by using a formula (11) and a formula (12)>

And->

Each iteration causes a cost function C _β The value of (X|Z) decreases until it converges to stop the iteration;

wherein z=wh; operator ° represents the product of two matrix corresponding elements; operator ≡ represents assigning left object with its right object; h ^T Representing a transpose of matrix H; z is Z ^a Representing the a-th power of the elements in the matrix Z; beta takes a value of 0.5;

s2013, for each note, the spectrum characteristic matrix of the corresponding stage

Is>

Go through l ₂ Normalization processing of norms is as follows:

s2014, sequencing the pitches of 88 notes of the piano in order from small to large, and respectively splicing to obtain a note transient characteristic dictionary base matrix in a note transient stage and a note transient characteristic dictionary base matrix in a steady state stage

And note steady state feature dictionary base matrix->

The following is shown:

a note transient spectral feature matrix representing a kth note sample transient stage; />

A note stationary spectral feature matrix representing a stationary phase of a kth note sample.

As shown in fig. 7, the structure block diagram of the note event segment spectrum decomposition in the multi-pitch estimation step in the embodiment of the present invention is as follows:

s2021, respectively performing Q-switching transformation on a transient stage and a steady-state stage of the note event segment obtained in the note event detection step to obtain a spectrum matrix of the transient stage

And spectral matrix of steady state phase->

And normalized, where m=1, …, MFrequency is represented, n=1, …, N represents the number of frames;

s2022, frequency spectrum matrix of transient phase after normalization

And spectral matrix of steady state phase->

Applying prepared note transient feature dictionary base matrix respectively>

And note steady state feature dictionary base matrix->

Go->

Nonnegative matrix factorization estimation of norm sparsity constraint to obtain coefficient matrix ++in note transient stage>

Coefficient matrix for sum note steady-state phase>

Adding to obtain coefficient matrix of the note event segment

The formula is as follows:

wherein, the operator degree represents the product of the corresponding elements of the two matrixes; operator ≡ represents assigning left object with its right object;

a variable value x representing the minimum value of the objective function f; h _t A coefficient matrix representing a transient phase; h _s A coefficient matrix representing a steady state phase; w (W) _t Representing a dictionary base matrix of transient features of notes; w (W) _s Representing a note steady state feature dictionary base matrix; lambda (lambda) ₁ And lambda (lambda) ₂ Representing the strength of the sparse constraint; />

And->

Represents added->

The norm constraint term, the parameters of which take p=2, q=0.5; />

And->

Representing the addition of a respective coefficient matrix>

Gradient of norm constraint term; beta takes a value of 0.5;

s2023, arranging the obtained coefficient matrix H in order from big to small, taking the first M coefficients as coefficients of candidate fundamental frequencies, then calculating a threshold lambda.delta S, and judging that a single note exists in a note event section when the coefficient of the single note exceeds a set threshold; wherein λ is a constant, the value range is 0< λ <1, Δs is the difference between the maximum coefficient and the mth maximum coefficient, i.e., Δs=s (1) -S (M);

The present invention can be preferably implemented as above and the aforementioned technical advantages can be achieved.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A multi-pitch estimation method based on a note transient dictionary and a steady state dictionary is characterized by comprising the following steps:

a multi-pitch estimation step, extracting the number of notes and the pitch value of each note event segment, and splicing the note numbers and the pitch values according to a time sequence to obtain a note sequence containing multi-notes, wherein the working process is as follows:

The non-negative matrix factorization method of norm sparsity constraint makes multi-note result estimation,i.e. the number of notes present in a note event segment and their pitch values.

2. A multi-pitch estimation method based on a transient dictionary and a steady state dictionary of notes according to claim 1, wherein the octave filter bank includes a plurality of band pass filters, wherein the number of band pass filters and the cut-off frequency of each band pass filter are determined by the number of piano keys and the octave twelve-tone law.

3. A multi-pitch estimation method based on a transient dictionary and a steady state dictionary of notes according to claim 1, wherein the number of band pass filters in the octave filter bank is 6, and the cut-off frequency is the fundamental frequency mean of two adjacent tones separated by each octave.

4. The multi-pitch estimation method based on a transient dictionary and a steady state dictionary of notes according to claim 2, wherein the note event detection step works as follows:

s101, inputting piano music in a wav format, dividing audio into 6 sub-bands after passing through an octave filter bank, carrying out normalization, framing and windowing, Q-switching conversion and power scaling on each sub-band, obtaining a power scaling energy spectrum of each frame, calculating first-order difference of adjacent frames, and adding all lines of a specific frame to obtain a novel function of the sub-band;

s102, calculating root mean square energy of each sub-band in each frame and summing to obtain root mean square energy of the sub-band, taking energy specific weight of each sub-band relative to the whole music band as a weight coefficient, and weighting and summing novel functions of each sub-band;

s104, carrying out normalization processing on the starting point detection function and detecting a peak value of the starting point detection function, wherein the time corresponding to the peak value is the starting point time of a note, setting a time threshold value, and combining the starting points of notes with adjacent time differences smaller than the threshold value to be regarded as a note starting point;

5. The multi-pitch estimation method based on the transient dictionary and the steady state dictionary of notes according to claim 1, wherein the process of step S201 is as follows:

s2012, respectively carrying out nonnegative matrix factorization on the normalized frequency spectrum matrixes in the transient stage and the steady state stage to obtain spectrum base atoms of the note sample, thereby obtaining a frequency spectrum characteristic matrix in the corresponding stage of the note;

s2013, carrying out normalization processing on each spectrum base atom of each note spectrum characteristic matrix;

6. The multi-pitch estimation method according to claim 1, wherein the procedure of step S202 is as follows: