CN112735365A - Probability model-based automatic extraction algorithm for main melody - Google Patents

Probability model-based automatic extraction algorithm for main melody Download PDF

Info

Publication number
CN112735365A
CN112735365A CN202011545338.7A CN202011545338A CN112735365A CN 112735365 A CN112735365 A CN 112735365A CN 202011545338 A CN202011545338 A CN 202011545338A CN 112735365 A CN112735365 A CN 112735365A
Authority
CN
China
Prior art keywords
fundamental frequency
calculating
probability
value
window
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011545338.7A
Other languages
Chinese (zh)
Inventor
米岚
何晓娇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Yue Party Information Technology Co ltd
Original Assignee
Chongqing Yue Party Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Yue Party Information Technology Co ltd filed Critical Chongqing Yue Party Information Technology Co ltd
Priority to CN202011545338.7A priority Critical patent/CN112735365A/en
Publication of CN112735365A publication Critical patent/CN112735365A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/361Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/005Algorithms for electrophonic musical instruments or musical processing, e.g. for automatic composition or resource allocation
    • G10H2250/015Markov chains, e.g. hidden Markov models [HMM], for musical processing, e.g. musical analysis or musical composition

Abstract

The invention relates to the field of music information retrieval, in particular to a probabilistic model-based automatic extraction algorithm for a main melody, which comprises the following steps: s101, calculating a short-time average amplitude difference function; s102, calculating an accumulated average normalized amplitude difference function; s103, extracting a plurality of candidate fundamental frequencies and probabilities; s104, calculating a main fundamental frequency; and S105, calculating the main melody notes. The invention can automatically generate the main melody from the vocal file of the song, and because the cumulative average normalized amplitude difference function and the Beta distribution model of the threshold are applied, the accuracy of the pitch value is improved, and the problem of octave deviation is obviously reduced; by introducing the HMM model, the start-stop time of the main melody is more accurate, and the track curve is smoother.

Description

Probability model-based automatic extraction algorithm for main melody
Technical Field
The invention relates to the field of music information retrieval, in particular to a probabilistic model-based automatic extraction algorithm for a main melody.
Background
The main melody refers to the fundamental frequency of the human voice, and the main melody is extracted by estimating the pitch or fundamental frequency of the monophonic note sequence corresponding to the main melody from the standard human voice audio of a song, and then transcribing the pitch or fundamental frequency into a text file or a MIDI binary file and the like. As an important issue in the field of music information retrieval, automatic extraction of the main melody plays an important role in music information interaction, such as music recognition, intonation analysis, humming retrieval, and the like. A typical application of pitch analysis is an automatic pitch scoring system of karaoke software of a mobile phone.
Most of the existing melody extraction methods estimate pitch saliency based on the result of time-frequency transformation, and then extract melody lines by applying a tracking rule; there are many problems with this type of approach, such as octave bias, which is that the pitch F0 is estimated as an integer or fractional multiple of the true value; in addition, some tracking rules are not smooth enough to cause note pitch jumps, which can lead to pitch errors.
Disclosure of Invention
In order to solve the above problems, the main object of the present invention is to provide an automatic extraction algorithm for a main melody based on a probability model, which can improve the accuracy of identification and tracking of the main melody of a vocal audio file, and specifically comprises: the accuracy of the pitch value is improved, and the octave deviation problem is reduced; the start-stop time of the main melody is more accurate, and the track curve is smoother.
In order to achieve the purpose, the invention adopts the technical scheme that:
the automatic extraction algorithm of the main melody based on the probability model is characterized by comprising the following steps:
s101, calculating a short-time average amplitude difference function;
s102, calculating an accumulated average normalized amplitude difference function;
s103, extracting a plurality of candidate fundamental frequencies and probabilities;
s104, calculating a main fundamental frequency;
and S105, calculating the main melody notes.
Further, the specific process of step S101 is: dividing audio time into frames, and calculating a short-time average amplitude difference function of each frame of time domain signal;
recording of an input digital audio signal xtHas a sub-frame sequence of xiI is 1, …, 2W; subscript i denotes the ith sample point; the typical window length of the frame is 2048 as 2W, and the window length is 1/4; the calculation formula of the short-time average amplitude difference function is as follows (1):
Figure BDA0002855517050000021
where t represents the t-th speech window after framing; τ is the number of possible cycles, and ranges from an integer between 0 and W.
Further, the specific process of step S102 is: performing cumulative average normalization on the result of step S101, and traversing all τ according to the following formula to obtain the CMNDF vector of the tth speech window, as shown in the following formula (2):
Figure BDA0002855517050000031
the first value of the CMNDF vector is 0 and has a length W.
Further, the specific process of step S103 is: if the minimum value point in the cumulative average normalized amplitude difference function of the voice window is smaller than a certain threshold value at the same time, the minimum value point can be regarded as fundamental frequency or harmonic frequency;
and establishing a Beta probability distribution model of a threshold, and calculating a cumulative distribution function of the candidate fundamental frequencies to calculate a probability value so as to obtain a candidate fundamental frequency probability pair set of each voice window.
Further, the specific process of step S104 is: dividing the fundamental frequency range of human voice into a plurality of discrete parts (called bins for short), wherein any fundamental frequency value corresponds to a certain fundamental frequency bin region; taking the serial number of bin as a hidden state to track the fundamental frequency;
the fundamental frequency tracking process is modeled as a Hidden Markov process (HMM); establishing a state transition probability matrix between fundamental frequency bins, and calculating an observation probability vector of each voice window through the candidate fundamental frequency probability pair set; and decoding the HMM, and calculating the main fundamental frequency of the whole audio file through a Viterbi algorithm.
Further, the specific process of step S105 is: the step is mainly to smooth the fundamental frequency calculated in the step S104 and establish the HMM process of note pitch value; dividing a pitch range into a plurality of bins, wherein each pitch bin can have three states of starting, stabilizing and muting, and the states are converted according to a Gaussian distribution probability model;
calculating an observation probability vector of the current speech window according to the base frequency value of the known speech window, and obtaining a smoothed pitch value of each speech window according to a Viterbi decoding algorithm of an HMM (hidden Markov model); several time-continuous speech windows of equal pitch are connected to form a complete note, and all speech windows are traversed, so as to obtain the main melody track of the whole audio file.
The invention has the beneficial effects that:
the invention can automatically generate the main melody from the vocal file of the song, and because the cumulative average normalized amplitude difference function and the Beta distribution model of the threshold are applied, the accuracy of the pitch value is improved, and the problem of octave deviation is obviously reduced; by introducing the HMM model, the start-stop time of the main melody is more accurate, and the track curve is smoother.
The invention can automatically generate the main melody from the vocal file of the song, and the scoring accuracy of the user works is improved by typical application such as an automatic scoring system of Karaoke of a mobile phone; meanwhile, the cost of manual manufacturing is saved, and the method has great economic benefit.
Drawings
FIG. 1 is a flowchart illustrating the automatic extraction of the main melody according to the present invention.
Fig. 2 is a schematic diagram of a probability density function image of the threshold S in step S103 of this embodiment.
Detailed Description
The invention relates to a probabilistic model-based automatic extraction algorithm for a main melody (figure 1 is a flow chart for automatically extracting the main melody), which comprises the following steps:
A. step S101: the short-time average amplitude difference function (abbreviated as AMDF, the same below) is calculated in a window. The average amplitude difference function corresponding to the periodic speech signal at the position of the integral multiple of the period will be small. Therefore, for audio time framing, a short-time average amplitude difference function of each frame of time domain signal is calculated, and then a possible base frequency position can be found according to a change rule.
Recording of an input digital audio signal xtHas a sub-frame sequence of xiI is 1, …, 2W; subscript i denotes the ith sample point; the window length of the frame is typically 2048W, and the step size is 1/4 of the window length. The calculation formula of the short-time average amplitude difference function is as follows (1):
Figure BDA0002855517050000051
where t represents the t-th speech window after framing; τ is the number of possible cycles, and ranges from an integer between 0 and W.
Step S102: a cumulative average normalized amplitude difference function (abbreviated CMNDF, the same below) is calculated. The minimum position of the short-time average amplitude difference function calculated in step S101 may be the fundamental frequency and the resonant frequency thereof. In order to find the minimum value more smoothly and avoid mistaking 0 or integer multiple of the fundamental frequency as the fundamental frequency, the cumulative average normalization function of the short-time average amplitude difference sequence is firstly calculated and used as the input of the next step. Namely: in order to more accurately analyze the candidate value of the fundamental frequency, the result of step S101 is normalized by the cumulative average, and the CMNDF vector of the tth speech window can be obtained by traversing all τ according to the following formula (2):
Figure BDA0002855517050000052
the first value of the CMNDF vector is 0 and has a length W.
Step S103: a plurality of candidate fundamental frequencies and probabilities are extracted. According to the characteristics of periodic signals, the minimum value point in the cumulative average normalized amplitude difference function of the voice window can be considered as fundamental frequency or harmonic frequency if the minimum value point is smaller than a certain threshold value at the same time. And establishing a Beta probability distribution model of a threshold, and calculating a cumulative distribution function of the candidate fundamental frequencies to calculate a probability value so as to obtain a candidate fundamental frequency probability pair set of each voice window. Namely: according to the characteristics of the periodic signal, if the minimum value point of the CMNDF vector calculated in step S102 meets a threshold value S smaller than a certain value, it can be considered as a fundamental frequency or a harmonic frequency.
(a) The empirical value of the threshold S is 0.1, where S is considered to be a random variable satisfying the Beta distribution; typical parameters are α ═ 2 and β ═ 18, i.e., S to Beta (2,18), with a mean value of the Beta distribution of 0.1. Taking N as 100 equal scores in the range of 0-1, and taking the value s of a random variablei1,2, 100, a probability density function P(s) is calculatedi) And a probability density function image of S is plotted (see fig. 2).
(b) Finding the CMNDF vector yt(τ) set of minima points { τ0};
(c) For each minimum point τ0Calculating the fundamental frequency period T as tau0Has a probability of
Figure BDA0002855517050000061
Here []Representing an Everson bracket (Iverson breaker), if the condition in the bracket is satisfied, the expression is 1, otherwise, the expression is 0; p(s)i) Is a probability density function of the threshold S.
(d) Traversing and calculating a minimal value point set { tau0Normalizing the probability values to obtain a set formed by a plurality of fundamental frequency period probability pairs of the t-th voice window.
zτ(t)={(τ1,p1),(τ2,p2),...(τm,pm),} (4)
(e) The fundamental frequency is calculated by f ═ fs/τ according to the reciprocal relationship of frequency and period, where fs is the sampling rate of the input audio and a typical value is 44100Hz, thus converting the equation into the set of fundamental frequency probability pairs for the tth speech window as shown in equation (5):
Z(t)={(f1,p1),(f2,p2),...,(fm,pm)} (5)
step S104: the dominant fundamental frequency is calculated. The fundamental frequency range of human voice can be divided into a plurality of discrete parts (called bins for short), and any fundamental frequency value can be corresponding to a certain fundamental frequency bin region; taking the sequence number of bin as a Hidden state, modeling the fundamental frequency tracking process as a Hidden Markov process (HMM); establishing a state transition probability matrix between fundamental frequency bins, and calculating an observation probability vector of each voice window through the candidate fundamental frequency probability pair set in the step S102; thus, the fundamental frequency value of each speech window is solved and converted into the decoding process of the HMM, and the main fundamental frequency of the whole audio file can be calculated through a famous Viterbi algorithm. The method specifically comprises the following steps:
this step implements pitch tracking by calculating the fundamental frequency probability pairs of all the frame-divided speech windows of the audio in the previous step S103. Firstly, limiting the fundamental frequency range of human voice, wherein the typical value is 55Hz to 880Hz, and the pitch range of the human voice corresponds to A1 to A5 and the pitch value is 33 to 80 of a piano keyboard, and the total number of the human voice is 48 semitones; to improve the accuracy of the calculation, a semitone range is divided equally into a number of parts, typically 5 (e.g. A1 to # A1 divided into A1.1,A1.2,A1.3,A1.3,A1.5) The specific procedure of the algorithm is described below in terms of 5 equal divisions. The whole fundamental frequency corresponds to 240 pitch statistics stacks (abbreviated as bin) with the total of M5 × 48 for the final pitch, the fundamental frequency variation process is modeled as an HMM process, denoted as HMMpitch: for the tth speech window, the pitch variable ht240 × 2 ═ 480 states of the following formula (6):
ht~{A1.1,A1.2,..,#G5.5,-A1.1,-A1.2,..,-#G5.5} (6)
pitches with negative signs here indicate the case where the corresponding positive pitch is unvoiced, e.g. -a1.1Corresponds to A1.1(ii) a Observed variable OtValue range of (a) and htAnd (3) the following formula (7);
Ot~{A1.1,A1.2,..,#G5.5,-A1.1,-A1.2,..,-#G5.5} (7)
modeling the state transition probability vector of each hidden state into triangular probability distribution (triangular distribution) with the current state as the maximum value, wherein the semitone number deviation of two adjacent voice windows does not exceed a certain distance; typical values for the spacing on both sides are 2.5 semitones, corresponding to 5 bins.
In the set of fundamental frequency probability pairs calculated in step S103, for a certain fundamental frequency probability pair, the nearest bin is found, numbered M, where M is 1,2
Figure BDA0002855517050000081
Corresponding to other bins
Figure BDA0002855517050000082
Each speech window utterance and non-utterance is an equivalent probability event, and then the observation probability corresponding to the ith bin of the tth speech window is:
Figure BDA0002855517050000083
making the initial state probability distribution be uniform distribution, namely equal probability of each state; and simultaneously, taking the bin corresponding to the maximum probability value of the set z (t) as an observed value of the t-th voice window. Knowing the observed value and the probability model of the HMMpitch, calculating the prediction problem of the HMM according to a Viterbi decoding algorithm, and obtaining the fundamental frequency estimation f of the t-th speech windowpitch(t)。
Step S105: calculating the main melody note. This step is for f calculated in step S104pitchAnd (t) performing smoothing, wherein the smoothing mainly comprises 3 operations of frequency conversion, HMM decoding and note connection.
(I) The frequency is shifted by pitch. First, the frequency value of each speech window is converted into a pitch value according to the following equation (9):
Nnote(t)=12*log2(fpitch(t)/440)+69 (9)
(II) HMM decoding. This operation mainly accomplishes smoothing Nnote(t) obtaining QtHere again, the HMM process is utilized to model the pitch value Q of each speech windowtThe course of the change is marked as HMMnote. The pitch range is the same as step S104, and N is totalsSemitone, typical value of pitch range is 33 ~ 80, N at this moments48; each semitone is equally divided into NbpsBin, NbpsThe range of (a) is an integer of 2 to 5, and the typical value is 3; there may be 3 states for each bin, denoted as attack (a), stable (b), and silence (c), and N is writtenspb3. The state of HMMnote has Nnote_state=Ns*Nbps*NspbTypical value is Nnote_state=432。
Let 3 states of the ith bin be { q }i,aqi,bqi,c},i=1,2,...,Ns*Nspb. Pitch value Q of each speech windowtPresence of Nnote_statePossible states are converted according to the following probability rules (1) to (5):
(1) the probability of the i-th bin transitioning from state a to state a is a constant, denoted as P (q)i,aqi,a)=P(Qt=qi,a|Qt-1=qi,a) The value range is 0.85-0.95, and the typical value is 0.9;
(2) probability P (q) of state a to transition to state b of ith bini,aqi,b)=P(Qt=qi,b|Qt-1=qi,a)=1-P(qi,aqi,a);
(3) The probability of the i-th bin's state b transitioning to state b is a constant, P (q)i,bqi,b)=P(Qt=qi,b|Qt-1=qi,b) The value range is 0.95-0.99, and the typical value is 0.99;
(4) probability P (q) of state b to transition to state c for the ith bini,bqi,c)=P(Qt=qi,b|Qt-1=qi,b)=1-P(qi,bqi,b) Typical values are 0.01;
(5) the probability of the transition from state c of the ith bin to state a of the jth bin is:
Figure BDA0002855517050000091
here, the
Figure BDA0002855517050000092
Represents the semitone difference between the two bins i and j; f. ofnotes) Satisfy a Gaussian distribution, i.e.
Figure BDA0002855517050000093
Variance (variance)
Figure BDA0002855517050000094
The value range of (A) is 0.65-0.75, and the typical value is 0.7.
(6) The state transition probability between the other states is 0.
Pitch observation state VtConsistent with the hidden state. The pitch observation N of the tth speech window is knownnote(t), the observation probability is a gaussian distribution of the difference in pitch between the state and the observed value. The observation probability vector of the tth speech window is calculated according to the following formula (10):
Figure BDA0002855517050000101
here, the first and second liquid crystal display panels are,
Figure BDA0002855517050000102
representing the pitch difference, the value range of r is 0.05-0.2, and the typical value is 0.1; mi,a(t)、Mi,b(t)、Mi,c(t) respectively representing the observation probabilities of the 3 different states of the ith bin; f. ofi,a、fi,b、fi,cIs the gaussian distribution of 3 different states of the ith bin; the mean of the gaussian distribution is usually taken to be 0 and the variance parameter is typically 1.0.
The initial state probability distribution of the HMMnote process is made to be uniform distribution, so that the pitch value Q of the tth speech window can be calculated again according to the decoding process of the HMMt
(III) connecting musical notes. Connecting a plurality of continuous speech windows with equal pitch values and within a limited range in the step (II), and forming a note set note (k) ═ Q in time sequencei|sk≤i≤ek,Qi=Qk}; start time T of notek,startThe start time of the speech window is started for this note set; end time Tk,endThe end time of the last speech window for this note set; the pitch value, start time and end time form a pair of note triplets (Q)k,Tk,start,Tk,end) (ii) a And traversing the whole audio file to obtain all note triples, thus obtaining the main melody of the whole audio file.
The invention has the advantages that:
the invention can automatically generate the main melody from the vocal file of the song, and because the cumulative average normalized amplitude difference function and the Beta distribution model of the threshold are applied, the accuracy of the pitch value is improved, and the problem of octave deviation is obviously reduced; by introducing the HMM model, the start-stop time of the main melody is more accurate, and the track curve is smoother.
The invention can automatically generate the main melody from the vocal file of the song, and the scoring accuracy of the user works is improved by typical application such as an automatic scoring system of Karaoke of a mobile phone; meanwhile, the cost of manual manufacturing is saved, and the method has great economic benefit.
The above embodiments are merely illustrative of the preferred embodiments of the present invention, and not restrictive, and various changes and modifications to the technical solutions of the present invention may be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are intended to fall within the scope of the present invention defined by the appended claims.

Claims (6)

1. The automatic extraction algorithm of the main melody based on the probability model is characterized by comprising the following steps:
s101, calculating a short-time average amplitude difference function;
s102, calculating an accumulated average normalized amplitude difference function;
s103, extracting a plurality of candidate fundamental frequencies and probabilities;
s104, calculating a main fundamental frequency;
and S105, calculating the main melody notes.
2. The probabilistic model-based automatic melody extraction algorithm according to claim 1, wherein the specific process of step S101 is as follows: dividing audio time into frames, and calculating a short-time average amplitude difference function of each frame of time domain signal;
recording of an input digital audio signal xtHas a sub-frame sequence of xiI is 1, …, 2W; subscript i denotes the ith sample point; the typical window length of the frame is 2048 as 2W, and the window length is 1/4; the calculation formula of the short-time average amplitude difference function is as follows (1):
Figure FDA0002855517040000011
where t represents the t-th speech window after framing; τ is the number of possible cycles, and ranges from an integer between 0 and W.
3. The probabilistic model-based automatic main melody extraction algorithm according to claim 2, wherein the specific process of step S102 is: performing cumulative average normalization on the result of step S101, and traversing all τ according to the following formula to obtain the CMNDF vector of the tth speech window, as shown in the following formula (2):
Figure FDA0002855517040000021
the first value of the CMNDF vector is 0 and has a length W.
4. The probabilistic model-based automatic melody extraction algorithm according to claim 3, wherein the specific process of step S103 is: if the minimum value point in the cumulative average normalized amplitude difference function of the voice window is smaller than a certain threshold value at the same time, the minimum value point can be regarded as fundamental frequency or harmonic frequency;
and establishing a Beta probability distribution model of a threshold, and calculating a cumulative distribution function of the candidate fundamental frequencies to calculate a probability value so as to obtain a candidate fundamental frequency probability pair set of each voice window.
5. The probabilistic model-based automatic main melody extraction algorithm according to claim 4, wherein the specific process of step S104 is: dividing the fundamental frequency range of human voice into a plurality of discrete parts (called bins for short), wherein any fundamental frequency value corresponds to a certain fundamental frequency bin region; taking the serial number of bin as a hidden state to track the fundamental frequency;
the fundamental frequency tracking process is modeled as a Hidden Markov process (HMM); establishing a state transition probability matrix between fundamental frequency bins, and calculating an observation probability vector of each voice window through the candidate fundamental frequency probability pair set; and decoding the HMM, and calculating the main fundamental frequency of the whole audio file through a Viterbi algorithm.
6. The probabilistic model-based automatic melody extraction algorithm according to claim 5, wherein the specific process of step S105 is: the step is mainly to smooth the fundamental frequency calculated in the step S104 and establish the HMM process of note pitch value; dividing a pitch range into a plurality of bins, wherein each pitch bin can have three states of starting, stabilizing and muting, and the states are converted according to a Gaussian distribution probability model;
calculating an observation probability vector of the current speech window according to the base frequency value of the known speech window, and obtaining a smoothed pitch value of each speech window according to a Viterbi decoding algorithm of an HMM (hidden Markov model); several time-continuous speech windows of equal pitch are connected to form a complete note, and all speech windows are traversed, so as to obtain the main melody track of the whole audio file.
CN202011545338.7A 2020-12-24 2020-12-24 Probability model-based automatic extraction algorithm for main melody Pending CN112735365A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011545338.7A CN112735365A (en) 2020-12-24 2020-12-24 Probability model-based automatic extraction algorithm for main melody

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011545338.7A CN112735365A (en) 2020-12-24 2020-12-24 Probability model-based automatic extraction algorithm for main melody

Publications (1)

Publication Number Publication Date
CN112735365A true CN112735365A (en) 2021-04-30

Family

ID=75605028

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011545338.7A Pending CN112735365A (en) 2020-12-24 2020-12-24 Probability model-based automatic extraction algorithm for main melody

Country Status (1)

Country Link
CN (1) CN112735365A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101504834A (en) * 2009-03-25 2009-08-12 深圳大学 Humming type rhythm identification method based on hidden Markov model
CN106547797A (en) * 2015-09-23 2017-03-29 腾讯科技(深圳)有限公司 Audio frequency generation method and device
CN108595648A (en) * 2018-04-27 2018-09-28 大连民族大学 Music Melody extraction system
CN108735231A (en) * 2018-04-27 2018-11-02 大连民族大学 Theme pitch sequence method of estimation
CN111326171A (en) * 2020-01-19 2020-06-23 成都嗨翻屋科技有限公司 Human voice melody extraction method and system based on numbered musical notation recognition and fundamental frequency extraction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101504834A (en) * 2009-03-25 2009-08-12 深圳大学 Humming type rhythm identification method based on hidden Markov model
CN106547797A (en) * 2015-09-23 2017-03-29 腾讯科技(深圳)有限公司 Audio frequency generation method and device
CN108595648A (en) * 2018-04-27 2018-09-28 大连民族大学 Music Melody extraction system
CN108735231A (en) * 2018-04-27 2018-11-02 大连民族大学 Theme pitch sequence method of estimation
CN111326171A (en) * 2020-01-19 2020-06-23 成都嗨翻屋科技有限公司 Human voice melody extraction method and system based on numbered musical notation recognition and fundamental frequency extraction

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHEVEIGNE? A D: ""YIN,a fundamental frequency estimator for speech and music"", 《JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA》, 31 December 2002 (2002-12-31) *
KLAPURI A: ""Probabilistic Modelling of Note Events in the Transcription of Monophonic Melodies"", 《IEEE》, 31 December 2004 (2004-12-31) *
MATTHIAS MAUCH: ""PYIN:a fundamental frequency estimator using probabilistic threshold distributions"", 《ICASSP》, 31 December 2014 (2014-12-31), pages 1 - 5 *
MAUCH M: ""Computer-aided Melody Note Transcription Using the Tony Software: Accuracy and Efficiency"", 《INTERNATIONAL CONFERENCE ON TECHNOLOGIES FOR MUSIC NOTATION & REPRESENTATION》, 31 December 2015 (2015-12-31), pages 1 - 7 *

Similar Documents

Publication Publication Date Title
Ryynänen et al. Automatic transcription of melody, bass line, and chords in polyphonic music
Benetos et al. Automatic music transcription: challenges and future directions
CN109979488B (en) System for converting human voice into music score based on stress analysis
Raphael Automatic Transcription of Piano Music.
JP5218052B2 (en) Language model generation system, language model generation method, and language model generation program
Cheng et al. Automatic chord recognition for music classification and retrieval
US8494847B2 (en) Weighting factor learning system and audio recognition system
JP2005043666A (en) Voice recognition device
McVicar et al. Leveraging repetition for improved automatic lyric transcription in popular music
CN110599987A (en) Piano note recognition algorithm based on convolutional neural network
CN111508505B (en) Speaker recognition method, device, equipment and storage medium
KR20090032972A (en) Method and apparatus for query by singing/huming
JP2010054802A (en) Unit rhythm extraction method from musical acoustic signal, musical piece structure estimation method using this method, and replacing method of percussion instrument pattern in musical acoustic signal
Raczynski et al. Multiple pitch transcription using DBN-based musicological models
Khadkevich et al. A probabilistic approach to simultaneous extraction of beats and downbeats
US8942977B2 (en) System and method for speech recognition using pitch-synchronous spectral parameters
Nishikimi et al. End-to-end melody note transcription based on a beat-synchronous attention mechanism
CN112735365A (en) Probability model-based automatic extraction algorithm for main melody
Takeda et al. Rhythm and tempo analysis toward automatic music transcription
Ikemiya et al. Transcribing vocal expression from polyphonic music
Li et al. Construction and analysis of hidden Markov model for piano notes recognition algorithm
Benetos et al. Improving automatic music transcription through key detection
US20130144612A1 (en) Pitch Period Segmentation of Speech Signals
JP4576612B2 (en) Speech recognition method and speech recognition apparatus
Joder et al. Hidden discrete tempo model: A tempo-aware timing model for audio-to-score alignment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210430