CN112259063B - Multi-pitch estimation method based on note transient dictionary and steady state dictionary - Google Patents

Multi-pitch estimation method based on note transient dictionary and steady state dictionary Download PDF

Info

Publication number
CN112259063B
CN112259063B CN202010934594.9A CN202010934594A CN112259063B CN 112259063 B CN112259063 B CN 112259063B CN 202010934594 A CN202010934594 A CN 202010934594A CN 112259063 B CN112259063 B CN 112259063B
Authority
CN
China
Prior art keywords
note
transient
dictionary
notes
steady
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010934594.9A
Other languages
Chinese (zh)
Other versions
CN112259063A (en
Inventor
韦岗
姜云华
曹燕
王一歌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010934594.9A priority Critical patent/CN112259063B/en
Publication of CN112259063A publication Critical patent/CN112259063A/en
Application granted granted Critical
Publication of CN112259063B publication Critical patent/CN112259063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H3/00Instruments in which the tones are generated by electromechanical means
    • G10H3/12Instruments in which the tones are generated by electromechanical means using mechanical resonant generators, e.g. strings or percussive instruments, the tones of which are picked up by electromechanical transducers, the electrical signals being further manipulated or amplified and subsequently converted to sound by a loudspeaker or equivalent instrument
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0033Recording/reproducing or transmission of music for electrophonic musical instruments
    • G10H1/0041Recording/reproducing or transmission of music for electrophonic musical instruments in coded form
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses a multi-pitch estimation method based on a note transient dictionary and a steady state dictionary. The note event detection step comprises the steps of detecting note event occurrence, extracting starting time and ending time of notes of the note event occurrence, and dividing music into a plurality of note event segments, wherein each note event segment comprises one or more notes; the multi-pitch estimation step includes extracting the number of notes of each note event segment and a pitch value thereof; according to the invention, octaves are utilized to divide audio into a plurality of frequency bands, and the energy weighted power scaling frequency spectrum flux of each frequency band is subjected to windowing and difference solving processing, so that weaker starting sound in signals is enhanced, stronger starting sound is restrained, and more accurate starting point detection is achieved; utilizing note time variability, constructing dictionary from transient and steady state phases thereof, by applying to transient and steady state phases of note event segments
Figure DDA0002671490130000011
The matrix decomposition method of norm sparse constraint improves the performance of multi-pitch estimation.

Description

Multi-pitch estimation method based on note transient dictionary and steady state dictionary
Technical Field
The invention relates to the technical fields of note starting point detection, multi-pitch estimation and non-negative matrix factorization of piano music, in particular to a multi-pitch estimation method based on a note transient dictionary and a steady state dictionary.
Background
Music is an important means for expressing emotion and transmitting information to each other, is the expression of human intelligence and perceptual thinking, and a plurality of ideological emotions which cannot be accurately described by language can be expressed through the music, so the music is an important component of people's ideological communication and emotion expression.
With the development of computer networks and multimedia technology, the creation and dissemination of digital music is continually advancing. The rapid development of the digital music industry and the emergence of massive music data make the technologies of music classification, retrieval, recommendation, content analysis and the like more and more important, and become a research hotspot in the field of digital audio processing. In the era of digitization and informatization, the most important bottom means for obtaining various music is through music information retrieval (Music Information Retrieval, MIR) technology. The accuracy of note starting point detection and multi-pitch estimation lays a foundation stone in the field of digital audio processing, and plays a very important role in further analysis of music structure analysis, music retrieval, music style classification, singing recognition and the like.
In percussion instruments, the note change process can be generally divided into two stages, transient and steady, and further into four basic stages, namely Attack (Attack), decay (Decay), sustain (hold), and disappearance (Release). The amplitude from the attack stage to the decay stage changes relatively fast with time, and is a transient stage; the amplitude of the hold phase to the vanishing phase changes smoothly with time, and is a steady-state phase. Note onset detection is to detect the moment at which the note onset begins, while multi-pitch estimation is to extract the number of notes played at the same time and their pitch values.
For extracting the number of notes played at the same time and their pitch values from the piano music in wav format, it is first necessary to detect the start time and the end time of the notes, taking the interval between the start and end points of the notes as one note event segment, each note event segment containing one or more notes played at the same time. The following methods are used for detecting the starting point of a note: based on short-time energy and zero-crossing rate, based on phase characteristics, based on spectral differences and high frequency weighted component analysis, etc. The method is characterized in that the note starting point is judged by short-time energy and zero crossing rate, so that the accuracy is low; the method based on the phase characteristics is easily affected by low-frequency energy noise and is not suitable for the complex tone music of a piano; the method based on the spectrum difference needs to process a large number of frequency points, is large in calculated amount, and is difficult to detect weaker starting sound; the high frequency weighted component analysis method may cause weak notes at low frequencies to be difficult to detect due to the large weighting coefficients given to the high frequencies.
Secondly, the note event segments obtained by note event detection need to be subjected to multi-pitch estimation, and the note number and the pitch value of each note event segment are extracted. The conventional multi-pitch estimation method is divided into a characteristic method, a statistical model method and a spectrum decomposition method. However, these methods have respective defects, in which the feature method usually adopts a fixed screening rule, and cannot adaptively process the file to be tested; the calculation complexity of the statistical model method is higher, the octave error phenomenon is more prominent, and the estimation result of the number of notes is not ideal; conventional spectral decomposition methods typically use short-time fourier transform spectra for matrix decomposition and use single atoms in a dictionary of note features to construct a dictionary, i.e., a note is represented by a spectrum-based atom, thereby failing to fully utilize the time-varying spectral information during the note change.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provide a multi-pitch estimation method based on a note transient dictionary and a steady-state dictionary, which comprises a note event detection step and a multi-pitch estimation step. The music note event detection step utilizes octaves of a piano to divide audio into 6 sub-bands, takes the result of windowing and differencing processing of Power scaling spectrum flux (Power-Scaled Spectral Flux) weighted by energy of each sub-band as a starting point detection function, enhances weaker starting sound in a music signal and suppresses stronger starting sound, and simultaneously enhances the robustness of a detection method to achieve more accurate detection of the starting point of a music note; the multi-pitch estimation step constructs a note transient characteristic dictionary base matrix and a steady characteristic dictionary base matrix by utilizing the frequency spectrum characteristics of the transient stage and the steady stage of a single note sample according to the time variability of notes, and applies the frequency spectrum of the transient stage and the steady stage of a note event segment to be detected
Figure BDA0002671490110000031
According to the non-negative matrix factorization method of the norm sparse constraint, coefficient matrixes of corresponding parts are obtained, and therefore performance of piano multi-note estimation is improved.
The aim of the invention can be achieved by adopting the following technical scheme:
a multi-pitch estimation method based on a note transient dictionary and a steady state dictionary comprises the following steps:
a note event detection step of inputting a piece of piano music in wav format, detecting occurrence of a note event by using an octave filter bank, extracting start time and end time of a note thereof, taking a note between the start and end points of the note as a note event segment, dividing the piece of piano music in wav format into a plurality of note event segments, each note event segment containing one or a plurality of notes to be played at the same moment;
and a multi-pitch estimation step, wherein the number of notes and the pitch value of each note event segment are extracted, and a note sequence containing multi-notes is spliced according to a time sequence.
Further, the octave filter bank in the note event detecting step includes a plurality of band pass filters, wherein the number of band pass filters and the cutoff frequency of each band pass filter are determined by the number of piano keys and the octave twelve-tone average law.
Further, since the standard piano is a twelve-tone rhythm musical instrument, there are 88 keys, and the pitch of each key is fixed according to twelve-tone rhythm; meanwhile, the fundamental frequency distribution of the first three octaves of the piano is similar and the attenuation condition of the band-pass filter at the interval frequency of the first three octaves is considered, and the first three octaves are integrated into one frequency band. Thus, the number of band pass filters in the octave filter bank is 6, and the cut-off frequency is the fundamental frequency mean of two adjacent tones separated by each octave.
Further, the working process of the note event detection step is as follows:
s101, inputting piano music in a wav format, dividing audio into 6 sub-bands after passing through an octave filter bank, carrying out normalization, framing and windowing, Q-switching (Variable Q Transform, VQT) and power scaling on each sub-band, obtaining a power scaling energy spectrum of each frame, calculating first-order differences of adjacent frames, and adding all lines of a specific frame to obtain a novel Function (Novelty Function) of the sub-band;
s102, calculating root mean square energy of each sub-band in each frame, summing to obtain energy of the sub-band, taking energy specific weight of each sub-band relative to the whole music band as a weight coefficient, and weighting and summing novel functions of each sub-band;
s103, windowing the weighted and summed novel function, summing each frame in a window, performing first-order difference on each frame, calculating the difference value between the weighted and summed novel function and the windowed first-order difference, and taking the result as a note starting point detection function;
s104, carrying out normalization processing on the starting point detection function and detecting a peak value of the starting point detection function, wherein the time corresponding to the peak value is the starting point time of a note, setting a time threshold value, and combining the starting points of notes with adjacent time differences smaller than the threshold value to serve as a note starting point;
s105, setting a threshold value according to the short-time energy of a first frame from a note starting point, judging frame by frame, if a frame with short-time energy smaller than the threshold value is found, regarding the frame as a note ending point, and if the short-time energy of all frames before a second note starting point is larger than the threshold value, regarding the second note starting point as the ending point of the first note;
s106, taking the starting point and the end point of each note as a note event segment, wherein the note event segment comprises one or more notes played at the same moment.
Further, the working process of the multi-pitch estimation step is as follows:
s201, constructing a note transient characteristic dictionary base matrix and a note steady characteristic dictionary base matrix by utilizing spectrum characteristics of a transient stage and a steady stage of a single note sample;
s202, according to the obtained note transient characteristic dictionary base matrix and note steady-state characteristic dictionary base matrix, normalizing frequency spectrum application based on transient and steady-state phases of a note event segment to be identified
Figure BDA0002671490110000041
The non-negative matrix factorization method of the norm sparsity constraint performs multi-note result estimation, i.e., detects the number of notes present in a note event segment and their pitch values.
Further, the process of step S201 is as follows:
s2011, respectively performing Q-switching conversion on transient and steady states of 88 piano single-symbol sample signals to obtain respective frequency spectrum matrixes, and performing normalization processing;
s2012, respectively carrying out non-negative matrix factorization estimation on the normalized frequency spectrum matrixes in the transient stage and the steady state stage to obtain spectrum base atoms of the note sample, thereby obtaining a frequency spectrum feature matrix of the note in the corresponding stage;
s2013, carrying out normalization processing on each spectrum base atom of each note feature matrix;
s2014, sequencing the notes according to the order of the pitches of 88 notes of the piano from small to large, and respectively splicing to obtain a note transient characteristic dictionary base matrix and a note steady characteristic dictionary base matrix in a note transient stage and a note steady state stage.
Further, the process of step S202 is as follows:
s2021, respectively performing Q-switching transformation on a transient stage and a steady-state stage of the note event segment obtained in the note event detection step to obtain respective frequency spectrum matrixes, and performing normalization processing;
s2022, carrying out normalized transient stage and steady stage frequency spectrum matrix by adopting note transient characteristic dictionary base matrix and note steady characteristic dictionary base matrix
Figure BDA0002671490110000051
Non-negative matrix factorization estimation of norm sparse constraint is carried out to obtain coefficient matrixes of corresponding stages, and the coefficient matrixes of the note event segments are obtained by adding;
s2023, performing threshold post-processing on the obtained coefficient matrix, and judging that a single note exists in the note event segment when the coefficient of the single note exceeds a set threshold value;
s2024, splicing the note sequences containing the multi-notes according to the time sequence, so as to obtain a multi-pitch estimation result.
Compared with the prior art, the invention has the following advantages and effects:
1) In the note event detection step, the occurrence of a note event is detected, the starting time and the ending time of a note are extracted, and the fact that the starting of a note with a certain low frequency is very soft is considered, so that the weaker starting and the stronger starting in a music signal are enhanced, and the robustness of a detection method is enhanced by adopting frequency band segmentation weighting and power scaling; further, through windowing and differencing processing, the correctly detected note onset is increased, and false alarms can be reduced, so that the overall detection performance is improved.
2) In the multi-pitch estimation step, the multi-pitch estimation is based on the decomposition estimation of the note event segment of the note event, the striking moment, the pitch value and the ending moment of the note in the piano playing process are considered in advance, and the false playing error is avoided; taking into account the continuity of the music and the time variability of the notes, the notes are divided into two main phases: transient and steady-state phases, respectively processing transient and steady-state phases of note event segments, and using a multi-atomic note feature dictionary and introduction
Figure BDA0002671490110000061
The norm sparsity constraint realizes higher precision estimation while maintaining a rapid process based on a non-negative matrix factorization method.
Drawings
FIG. 1 is a flow chart of a multi-pitch estimation method based on a transient dictionary and a steady state dictionary of notes disclosed in the present invention;
FIG. 2 is a graph of the frequency response of an octave filter bank in an embodiment of the invention;
FIG. 3 is a schematic diagram of the composition of the phonetic symbols at each stage in an embodiment of the invention;
FIG. 4 is a diagram of note event divisions for a portion of audio in an embodiment of the present invention;
FIG. 5 is a block diagram of the steps of detecting a phonetic symbol event in an embodiment of the present invention;
FIG. 6 is a block diagram of the construction of a phonetic symbol feature dictionary base matrix in a multiple pitch estimation step in an embodiment of the present invention;
FIG. 7 is a block diagram of a spectral decomposition of a note event segment in a multiple pitch estimation step in an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples
As shown in fig. 1, a flowchart of a multi-pitch estimation method based on a note transient dictionary and a steady state dictionary in the present embodiment is disclosed. The method comprises the following specific steps:
a note event detecting step of inputting a piece of piano music in wav format, detecting occurrence of note event by using an octave filter bank, extracting start time and end time of a note thereof, taking a space between the start and end points of the note as a note event segment (a note event segment is formed between the start and end points of the note as shown in fig. 4), dividing the piece of piano music in wav format into a plurality of note event segments each containing one or a plurality of notes played at the same time;
and a multi-pitch estimation step, wherein the number of notes and the pitch value of each note event segment are extracted, and a note sequence containing multi-notes is spliced according to a time sequence.
As shown in fig. 2, is a frequency response diagram of an octave filter bank in an embodiment of the invention. It includes a plurality of band-pass filters, wherein the number of band-pass filters and the cut-off frequency of each band-pass filter are determined by the number of piano keys and the octave twelve-tone average law. Since a standard piano is a twelve-tone rhythm musical instrument, there are 88 keys, and the pitch of each key is fixed according to twelve-tone rhythm; consider that the fundamental frequency of the first three octaves of the piano is distributed between 27.5Hz and 207.652Hz, and that the band-pass filter attenuates more at the frequencies of the three octaves apart, integrating the first three octaves into one band. Thus, the number of bandpass filters of the filter bank is 6, and the cut-off frequency is the fundamental frequency mean of two adjacent tones separated by each octave.
FIG. 3 is a schematic diagram showing the composition of each stage of the note according to the embodiment of the present invention, which is represented by the amplitude change from the beginning to the end of the note. The change process of notes is generally divided into transient and steady state phases, and further into four basic phases, namely attack, decay, hold and disappearance. The amplitude from the initial stage to the decay stage of the notes changes rapidly with time, and the transient stage is the transient stage; the amplitude of the hold phase to the vanishing phase changes smoothly with time, and is a steady-state phase. The starting point of the transient phase in the figure is the starting point of the note, and the object of the note event detection step is to detect this starting point.
As shown in fig. 5, the present invention is a block diagram of a voice symbol event detection step, which specifically includes the following steps:
s101, inputting piano music in a wav format, and dividing the audio into 6 sub-bands after passing through an octave filter bank;
s102, carrying out pretreatment such as normalization, framing and windowing on each sub-band, wherein the parameter setting considers that the sampling rate of the music in wav format is generally 44.1kHz, and a Hamming window is used for signals in order to avoid spectrum leakage, so that 2048 sampling points are taken for the window length, 512 sampling points are taken for the frame shift, the time difference of adjacent frames is about 11.6 milliseconds, namely the error time of a predicted value and an actual result is at most 11.6 milliseconds;
s103, performing Q-switching transformation on the preprocessed signals to obtain corresponding frequency spectrums X (K, N), and performing normalization processing, wherein K represents the frequency number of each frame, and N represents the frame number. In the conventional time-frequency transformation, the short-time Fourier transformation adopts a fixed length window, so that the problem of frequency resolution is easy to generate, meanwhile, the frequency points of the short-time Fourier transformation are linearly distributed and cannot be in one-to-one correspondence with the frequencies of piano notes in exponential distribution, and the estimation error of certain note frequencies is larger; the frequency bin distribution of the constant Q transform is not linear but exponential, but the time resolution is relatively low due to the large number of sampling points chosen at low frequencies. Therefore, a variable Q transform is employed that also has a higher temporal resolution at low frequencies, whose k-th frequency component has a bandwidth as follows:
Figure BDA0002671490110000081
δ k =α·f k +gamma formula (2)
Wherein f k Is the center frequency of frequency band k; delta k Is the bandwidth of the kth frequency component; f (f) min Analyzing the lowest frequency in the frequency interval of the music signal, and taking 27.5Hz; b is the number of frequency points in each octave, which is usually a multiple of 12, and 48 is taken in order to obtain richer spectral feature information; k is the number of frequency bands divided in the variable Q transform spectrum;
Figure BDA0002671490110000082
is a constant value and is only related to the value of b; gamma is the compensation parameter, when gamma>0, increasing the time resolution at low frequencies;
s104, carrying out power scaling on each normalized frame spectrum, calculating the first-order difference of adjacent frames, adding all rows of a specific frame to obtain a novel function of the sub-band, and adopting the following formula:
SF(k,n)=|X(k,n+1)| p -|X(k,n)| p formula (3)
Figure BDA0002671490110000091
Wherein SF (k, n) is expressed as the spectral flux of adjacent power scaling energy spectra; x (k, n) represents a variable Q transform spectrum of the input signal; NF (NF) i (n) a novel function represented as the subband; |x| p Represents the power of p of the element in x; h (x) = (x+|x|)/2 is represented as a half-wave rectification function; k represents the frequency number of each frame; p represents a power scaling factor, typically taken as 0.5;
s105, calculating root mean square energy of each frame of each sub-band and summing to obtain energy of the sub-band, taking energy specific weight of each sub-band relative to the whole music band as a weight coefficient, weighting and summing novel functions of each sub-band, and adopting the following formula:
Figure BDA0002671490110000092
Figure BDA0002671490110000093
wherein omega i Is the energy specific gravity of the ith sub-band relative to the whole music band, i.e. the weight coefficient; NF (NF) i (n) is a novel function of the ith subband; RMSE i Is the root mean square energy of the ith sub-band; NF (n) is a novel function after weighted summation;
s106, windowing the weighted and summed novel function NF (n), carrying out first-order difference on each frame in the summing window, then calculating the difference value between the weighted and summed novel function and the windowed first-order difference, taking the result as a note starting point detection function, and adopting the following formula:
Figure BDA0002671490110000094
ODF (n) =nf (n) - { w (n) -w (n-1) } formula (8)
Wherein W (n) is the sum of the length W frame windows after frame n for the novel function; ODF (n) is a note origin detection function;
s107, carrying out normalization processing on the starting point detection function and detecting a peak value of the starting point detection function, wherein the time corresponding to the peak value is the starting point time of a note, setting a time threshold value, and combining the starting points of notes with adjacent time differences smaller than the threshold value to serve as a note starting point, wherein the time threshold value is usually 50 milliseconds;
s108, setting a threshold according to the short-time energy of a first frame from a note starting point, judging from frame to frame, if a frame with short-time energy smaller than the threshold is found, then regarding the frame as a note end point, and if the short-time energy of all frames before a second note starting point is larger than the threshold, regarding the second note starting point as the end point of the first note, wherein the threshold is usually one tenth of the short-time energy of the first frame;
s109, taking the starting point and the end point of each note as a note event segment, wherein the note event segment comprises one or more notes played at the same moment.
Fig. 6 is a block diagram of a phonetic symbol feature dictionary base matrix construction in a multi-pitch estimation step according to an embodiment of the present invention, and the specific process is as follows:
s2011, respectively performing Q-switching conversion on transient phases and steady phases of 88 piano single symbol sample signals X (n) to obtain respective frequency spectrum matrixes X m×n And performing normalization processing, wherein m=1, …, M represents a frequency number, n=1, …, N represents a frame number;
s2012, respectively carrying out frequency spectrum matrix X on the transient stage and the steady-state stage after normalization m×n And carrying out nonnegative matrix factorization to obtain spectrum base atoms of the corresponding stage of the note sample. Assume that the spectrum matrix of the corresponding stage of the kth note sample is expressed as
Figure BDA0002671490110000101
For matrix->
Figure BDA0002671490110000102
Performing non-negative matrix factorization, and taking the rank as r:
Figure BDA0002671490110000103
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0002671490110000104
a spectral feature matrix representing a corresponding stage of the note sample; />
Figure BDA0002671490110000105
A coefficient matrix corresponding to the spectral feature matrix representing the corresponding stage of the note sample; rank r represents the number of atoms in constructing a dictionary. The decomposition operation adopts a cost function based on beta divergence, as shown in a formula (10); to get->
Figure BDA0002671490110000106
Firstly, random non-negative initialization is adopted, and then, the matrix is iteratively updated by using a formula (11) and a formula (12)>
Figure BDA0002671490110000107
And->
Figure BDA0002671490110000108
Each iteration causes a cost function C β The value of (X|Z) decreases until it converges to stop the iteration;
Figure BDA0002671490110000111
Figure BDA0002671490110000112
Figure BDA0002671490110000113
wherein z=wh; operator ° represents the product of two matrix corresponding elements; operator ≡ represents assigning left object with its right object; h T Representing a transpose of matrix H; z is Z a Representing the a-th power of the elements in the matrix Z; beta takes a value of 0.5;
s2013, for each note, the spectrum characteristic matrix of the corresponding stage
Figure BDA0002671490110000114
Is>
Figure BDA0002671490110000115
Go through l 2 Normalization processing of norms is as follows:
Figure BDA0002671490110000116
s2014, sequencing the pitches of 88 notes of the piano in order from small to large, and respectively splicing to obtain a note transient characteristic dictionary base matrix in a note transient stage and a note transient characteristic dictionary base matrix in a steady state stage
Figure BDA0002671490110000117
And note steady state feature dictionary base matrix->
Figure BDA0002671490110000118
The following is shown:
Figure BDA0002671490110000119
Figure BDA00026714901100001110
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA00026714901100001111
a note transient spectral feature matrix representing a kth note sample transient stage; />
Figure BDA00026714901100001112
A note stationary spectral feature matrix representing a stationary phase of a kth note sample.
As shown in fig. 7, the structure block diagram of the note event segment spectrum decomposition in the multi-pitch estimation step in the embodiment of the present invention is as follows:
s2021, respectively performing Q-switching transformation on a transient stage and a steady-state stage of the note event segment obtained in the note event detection step to obtain a spectrum matrix of the transient stage
Figure BDA00026714901100001113
And spectral matrix of steady state phase->
Figure BDA0002671490110000121
And normalized, where m=1, …, MFrequency is represented, n=1, …, N represents the number of frames;
s2022, frequency spectrum matrix of transient phase after normalization
Figure BDA0002671490110000122
And spectral matrix of steady state phase->
Figure BDA0002671490110000123
Applying prepared note transient feature dictionary base matrix respectively>
Figure BDA0002671490110000124
And note steady state feature dictionary base matrix->
Figure BDA0002671490110000125
Go->
Figure BDA0002671490110000126
Nonnegative matrix factorization estimation of norm sparsity constraint to obtain coefficient matrix ++in note transient stage>
Figure BDA0002671490110000127
Coefficient matrix for sum note steady-state phase>
Figure BDA0002671490110000128
Adding to obtain coefficient matrix of the note event segment
Figure BDA0002671490110000129
The formula is as follows:
Figure BDA00026714901100001210
Figure BDA00026714901100001211
Figure BDA00026714901100001212
wherein, the operator degree represents the product of the corresponding elements of the two matrixes; operator ≡ represents assigning left object with its right object;
Figure BDA00026714901100001213
a variable value x representing the minimum value of the objective function f; h t A coefficient matrix representing a transient phase; h s A coefficient matrix representing a steady state phase; w (W) t Representing a dictionary base matrix of transient features of notes; w (W) s Representing a note steady state feature dictionary base matrix; lambda (lambda) 1 And lambda (lambda) 2 Representing the strength of the sparse constraint; />
Figure BDA00026714901100001214
And->
Figure BDA00026714901100001215
Represents added->
Figure BDA00026714901100001216
The norm constraint term, the parameters of which take p=2, q=0.5; />
Figure BDA00026714901100001217
And->
Figure BDA00026714901100001218
Representing the addition of a respective coefficient matrix>
Figure BDA00026714901100001219
Gradient of norm constraint term; beta takes a value of 0.5;
s2023, arranging the obtained coefficient matrix H in order from big to small, taking the first M coefficients as coefficients of candidate fundamental frequencies, then calculating a threshold lambda.delta S, and judging that a single note exists in a note event section when the coefficient of the single note exceeds a set threshold; wherein λ is a constant, the value range is 0< λ <1, Δs is the difference between the maximum coefficient and the mth maximum coefficient, i.e., Δs=s (1) -S (M);
s2024, splicing the note sequences containing the multi-notes according to the time sequence, so as to obtain a multi-pitch estimation result.
The present invention can be preferably implemented as above and the aforementioned technical advantages can be achieved.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (6)

1. A multi-pitch estimation method based on a note transient dictionary and a steady state dictionary is characterized by comprising the following steps:
a note event detection step of inputting a piece of piano music in wav format, detecting occurrence of a note event by using an octave filter bank, extracting start time and end time of a note thereof, taking a note between the start and end points of the note as a note event segment, dividing the piece of piano music in wav format into a plurality of note event segments, each note event segment containing one or a plurality of notes to be played at the same moment;
a multi-pitch estimation step, extracting the number of notes and the pitch value of each note event segment, and splicing the note numbers and the pitch values according to a time sequence to obtain a note sequence containing multi-notes, wherein the working process is as follows:
s201, constructing a note transient characteristic dictionary base matrix and a note steady characteristic dictionary base matrix by utilizing spectrum characteristics of a transient stage and a steady stage of a single note sample;
s202, according to the obtained note transient characteristic dictionary base matrix and note steady-state characteristic dictionary base matrix, normalizing frequency spectrum application based on transient and steady-state phases of a note event segment to be identified
Figure FDA0004153238440000011
The non-negative matrix factorization method of norm sparsity constraint makes multi-note result estimation,i.e. the number of notes present in a note event segment and their pitch values.
2. A multi-pitch estimation method based on a transient dictionary and a steady state dictionary of notes according to claim 1, wherein the octave filter bank includes a plurality of band pass filters, wherein the number of band pass filters and the cut-off frequency of each band pass filter are determined by the number of piano keys and the octave twelve-tone law.
3. A multi-pitch estimation method based on a transient dictionary and a steady state dictionary of notes according to claim 1, wherein the number of band pass filters in the octave filter bank is 6, and the cut-off frequency is the fundamental frequency mean of two adjacent tones separated by each octave.
4. The multi-pitch estimation method based on a transient dictionary and a steady state dictionary of notes according to claim 2, wherein the note event detection step works as follows:
s101, inputting piano music in a wav format, dividing audio into 6 sub-bands after passing through an octave filter bank, carrying out normalization, framing and windowing, Q-switching conversion and power scaling on each sub-band, obtaining a power scaling energy spectrum of each frame, calculating first-order difference of adjacent frames, and adding all lines of a specific frame to obtain a novel function of the sub-band;
s102, calculating root mean square energy of each sub-band in each frame and summing to obtain root mean square energy of the sub-band, taking energy specific weight of each sub-band relative to the whole music band as a weight coefficient, and weighting and summing novel functions of each sub-band;
s103, windowing the weighted and summed novel function, summing each frame in a window, performing first-order difference on each frame, calculating the difference value between the weighted and summed novel function and the windowed first-order difference, and taking the result as a note starting point detection function;
s104, carrying out normalization processing on the starting point detection function and detecting a peak value of the starting point detection function, wherein the time corresponding to the peak value is the starting point time of a note, setting a time threshold value, and combining the starting points of notes with adjacent time differences smaller than the threshold value to be regarded as a note starting point;
s105, setting a threshold value according to the short-time energy of a first frame from a note starting point, judging frame by frame, if a frame with short-time energy smaller than the threshold value is found, regarding the frame as a note ending point, and if the short-time energy of all frames before a second note starting point is larger than the threshold value, regarding the second note starting point as the ending point of the first note;
s106, taking the starting point and the end point of each note as a note event segment, wherein the note event segment comprises one or more notes played at the same moment.
5. The multi-pitch estimation method based on the transient dictionary and the steady state dictionary of notes according to claim 1, wherein the process of step S201 is as follows:
s2011, respectively performing Q-switching conversion on transient and steady states of 88 piano single-symbol sample signals to obtain respective frequency spectrum matrixes, and performing normalization processing;
s2012, respectively carrying out nonnegative matrix factorization on the normalized frequency spectrum matrixes in the transient stage and the steady state stage to obtain spectrum base atoms of the note sample, thereby obtaining a frequency spectrum characteristic matrix in the corresponding stage of the note;
s2013, carrying out normalization processing on each spectrum base atom of each note spectrum characteristic matrix;
s2014, sequencing the notes according to the order of the pitches of 88 notes of the piano from small to large, and respectively splicing to obtain a note transient characteristic dictionary base matrix and a note steady characteristic dictionary base matrix in a note transient stage and a note steady state stage.
6. The multi-pitch estimation method according to claim 1, wherein the procedure of step S202 is as follows:
s2021, respectively performing Q-switching transformation on a transient stage and a steady-state stage of the note event segment obtained in the note event detection step to obtain respective frequency spectrum matrixes, and performing normalization processing;
s2022, carrying out normalized transient stage and steady stage frequency spectrum matrix by adopting note transient characteristic dictionary base matrix and note steady characteristic dictionary base matrix
Figure FDA0004153238440000031
Non-negative matrix factorization estimation of norm sparse constraint is carried out to obtain coefficient matrixes of corresponding stages, and the coefficient matrixes of the note event segments are obtained by adding;
s2023, performing threshold post-processing on the obtained coefficient matrix, and judging that a single note exists in the note event segment when the coefficient of the single note exceeds a set threshold value;
s2024, splicing the note sequences containing the multi-notes according to the time sequence, so as to obtain a multi-pitch estimation result.
CN202010934594.9A 2020-09-08 2020-09-08 Multi-pitch estimation method based on note transient dictionary and steady state dictionary Active CN112259063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010934594.9A CN112259063B (en) 2020-09-08 2020-09-08 Multi-pitch estimation method based on note transient dictionary and steady state dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010934594.9A CN112259063B (en) 2020-09-08 2020-09-08 Multi-pitch estimation method based on note transient dictionary and steady state dictionary

Publications (2)

Publication Number Publication Date
CN112259063A CN112259063A (en) 2021-01-22
CN112259063B true CN112259063B (en) 2023-06-16

Family

ID=74232269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010934594.9A Active CN112259063B (en) 2020-09-08 2020-09-08 Multi-pitch estimation method based on note transient dictionary and steady state dictionary

Country Status (1)

Country Link
CN (1) CN112259063B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115472143A (en) * 2022-09-13 2022-12-13 天津大学 Tonal music note starting point detection and note decoding method and device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4732071A (en) * 1987-02-13 1988-03-22 Kawai Musical Instrument Mfg. Co., Ltd Tuning indicator for musical instruments
DE102004049457B3 (en) * 2004-10-11 2006-07-06 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Method and device for extracting a melody underlying an audio signal
WO2008030197A1 (en) * 2006-09-07 2008-03-13 Agency For Science, Technology And Research Apparatus and methods for music signal analysis
CN105304073B (en) * 2014-07-09 2019-03-12 中国科学院声学研究所 A kind of music multitone symbol estimation method and system tapping stringed musical instrument
JP6733487B2 (en) * 2016-10-11 2020-07-29 ヤマハ株式会社 Acoustic analysis method and acoustic analysis device
US11024288B2 (en) * 2018-09-04 2021-06-01 Gracenote, Inc. Methods and apparatus to segment audio and determine audio segment similarities
CN110136730B (en) * 2019-04-08 2021-07-20 华南理工大学 Deep learning-based piano and acoustic automatic configuration system and method
CN110599987A (en) * 2019-08-25 2019-12-20 南京理工大学 Piano note recognition algorithm based on convolutional neural network

Also Published As

Publication number Publication date
CN112259063A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
CN106919662B (en) Music identification method and system
CN101599271B (en) Recognition method of digital music emotion
Virtanen et al. Combining pitch-based inference and non-negative spectrogram factorization in separating vocals from polyphonic music.
CN111369982A (en) Training method of audio classification model, audio classification method, device and equipment
Sajjan et al. Comparison of DTW and HMM for isolated word recognition
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
CN108682432B (en) Speech emotion recognition device
Yu et al. Sparse cepstral codes and power scale for instrument identification
Emiya et al. Automatic transcription of piano music based on HMM tracking of jointly-estimated pitches
CN112259063B (en) Multi-pitch estimation method based on note transient dictionary and steady state dictionary
Meng et al. Automatic music transcription based on convolutional neural network, constant Q transform and MFCC
CN108369803A (en) The method for being used to form the pumping signal of the parameter speech synthesis system based on glottal model
CN110534091A (en) A kind of people-car interaction method identified based on microserver and intelligent sound
Saksamudre et al. Comparative study of isolated word recognition system for Hindi language
KR100766170B1 (en) Music summarization apparatus and method using multi-level vector quantization
Rigaud et al. Does inharmonicity improve an NMF-based piano transcription model?
JP2012027196A (en) Signal analyzing device, method, and program
Eyben et al. Acoustic features and modelling
Dharini et al. CD-HMM Modeling for raga identification
Ghosh et al. Music instrument identification based on a 2-d representation
Bouchakour et al. Noise-robust speech recognition in mobile network based on convolution neural networks
Singh et al. Efficient pitch detection algorithms for pitched musical instrument sounds: A comparative performance evaluation
Płonkowski Using bands of frequencies for vowel recognition for Polish language
de León et al. A complex wavelet based fundamental frequency estimator in singlechannel polyphonic signals
Nosan et al. Speech recognition approach using descend-delta-mean and MFCC algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant