CN112735365A

CN112735365A - Probability model-based automatic extraction algorithm for main melody

Info

Publication number: CN112735365A
Application number: CN202011545338.7A
Authority: CN
Inventors: 米岚; 何晓娇
Original assignee: Chongqing Yue Party Information Technology Co ltd
Current assignee: Chongqing Yue Party Information Technology Co ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-04-30

Abstract

The invention relates to the field of music information retrieval, in particular to a probabilistic model-based automatic extraction algorithm for a main melody, which comprises the following steps: s101, calculating a short-time average amplitude difference function; s102, calculating an accumulated average normalized amplitude difference function; s103, extracting a plurality of candidate fundamental frequencies and probabilities; s104, calculating a main fundamental frequency; and S105, calculating the main melody notes. The invention can automatically generate the main melody from the vocal file of the song, and because the cumulative average normalized amplitude difference function and the Beta distribution model of the threshold are applied, the accuracy of the pitch value is improved, and the problem of octave deviation is obviously reduced; by introducing the HMM model, the start-stop time of the main melody is more accurate, and the track curve is smoother.

Description

Probability model-based automatic extraction algorithm for main melody

Technical Field

The invention relates to the field of music information retrieval, in particular to a probabilistic model-based automatic extraction algorithm for a main melody.

Background

The main melody refers to the fundamental frequency of the human voice, and the main melody is extracted by estimating the pitch or fundamental frequency of the monophonic note sequence corresponding to the main melody from the standard human voice audio of a song, and then transcribing the pitch or fundamental frequency into a text file or a MIDI binary file and the like. As an important issue in the field of music information retrieval, automatic extraction of the main melody plays an important role in music information interaction, such as music recognition, intonation analysis, humming retrieval, and the like. A typical application of pitch analysis is an automatic pitch scoring system of karaoke software of a mobile phone.

Most of the existing melody extraction methods estimate pitch saliency based on the result of time-frequency transformation, and then extract melody lines by applying a tracking rule; there are many problems with this type of approach, such as octave bias, which is that the pitch F0 is estimated as an integer or fractional multiple of the true value; in addition, some tracking rules are not smooth enough to cause note pitch jumps, which can lead to pitch errors.

Disclosure of Invention

In order to solve the above problems, the main object of the present invention is to provide an automatic extraction algorithm for a main melody based on a probability model, which can improve the accuracy of identification and tracking of the main melody of a vocal audio file, and specifically comprises: the accuracy of the pitch value is improved, and the octave deviation problem is reduced; the start-stop time of the main melody is more accurate, and the track curve is smoother.

In order to achieve the purpose, the invention adopts the technical scheme that:

the automatic extraction algorithm of the main melody based on the probability model is characterized by comprising the following steps:

s101, calculating a short-time average amplitude difference function;

s102, calculating an accumulated average normalized amplitude difference function;

s103, extracting a plurality of candidate fundamental frequencies and probabilities;

s104, calculating a main fundamental frequency;

and S105, calculating the main melody notes.

Further, the specific process of step S101 is: dividing audio time into frames, and calculating a short-time average amplitude difference function of each frame of time domain signal;

recording of an input digital audio signal x_tHas a sub-frame sequence of x_iI is 1, …, 2W; subscript i denotes the ith sample point; the typical window length of the frame is 2048 as 2W, and the window length is 1/4; the calculation formula of the short-time average amplitude difference function is as follows (1):

where t represents the t-th speech window after framing; τ is the number of possible cycles, and ranges from an integer between 0 and W.

Further, the specific process of step S102 is: performing cumulative average normalization on the result of step S101, and traversing all τ according to the following formula to obtain the CMNDF vector of the tth speech window, as shown in the following formula (2):

the first value of the CMNDF vector is 0 and has a length W.

Further, the specific process of step S103 is: if the minimum value point in the cumulative average normalized amplitude difference function of the voice window is smaller than a certain threshold value at the same time, the minimum value point can be regarded as fundamental frequency or harmonic frequency;

and establishing a Beta probability distribution model of a threshold, and calculating a cumulative distribution function of the candidate fundamental frequencies to calculate a probability value so as to obtain a candidate fundamental frequency probability pair set of each voice window.

Further, the specific process of step S104 is: dividing the fundamental frequency range of human voice into a plurality of discrete parts (called bins for short), wherein any fundamental frequency value corresponds to a certain fundamental frequency bin region; taking the serial number of bin as a hidden state to track the fundamental frequency;

the fundamental frequency tracking process is modeled as a Hidden Markov process (HMM); establishing a state transition probability matrix between fundamental frequency bins, and calculating an observation probability vector of each voice window through the candidate fundamental frequency probability pair set; and decoding the HMM, and calculating the main fundamental frequency of the whole audio file through a Viterbi algorithm.

Further, the specific process of step S105 is: the step is mainly to smooth the fundamental frequency calculated in the step S104 and establish the HMM process of note pitch value; dividing a pitch range into a plurality of bins, wherein each pitch bin can have three states of starting, stabilizing and muting, and the states are converted according to a Gaussian distribution probability model;

calculating an observation probability vector of the current speech window according to the base frequency value of the known speech window, and obtaining a smoothed pitch value of each speech window according to a Viterbi decoding algorithm of an HMM (hidden Markov model); several time-continuous speech windows of equal pitch are connected to form a complete note, and all speech windows are traversed, so as to obtain the main melody track of the whole audio file.

The invention has the beneficial effects that:

the invention can automatically generate the main melody from the vocal file of the song, and because the cumulative average normalized amplitude difference function and the Beta distribution model of the threshold are applied, the accuracy of the pitch value is improved, and the problem of octave deviation is obviously reduced; by introducing the HMM model, the start-stop time of the main melody is more accurate, and the track curve is smoother.

The invention can automatically generate the main melody from the vocal file of the song, and the scoring accuracy of the user works is improved by typical application such as an automatic scoring system of Karaoke of a mobile phone; meanwhile, the cost of manual manufacturing is saved, and the method has great economic benefit.

Drawings

FIG. 1 is a flowchart illustrating the automatic extraction of the main melody according to the present invention.

Fig. 2 is a schematic diagram of a probability density function image of the threshold S in step S103 of this embodiment.

Detailed Description

The invention relates to a probabilistic model-based automatic extraction algorithm for a main melody (figure 1 is a flow chart for automatically extracting the main melody), which comprises the following steps:

A. step S101: the short-time average amplitude difference function (abbreviated as AMDF, the same below) is calculated in a window. The average amplitude difference function corresponding to the periodic speech signal at the position of the integral multiple of the period will be small. Therefore, for audio time framing, a short-time average amplitude difference function of each frame of time domain signal is calculated, and then a possible base frequency position can be found according to a change rule.

Recording of an input digital audio signal x_tHas a sub-frame sequence of x_iI is 1, …, 2W; subscript i denotes the ith sample point; the window length of the frame is typically 2048W, and the step size is 1/4 of the window length. The calculation formula of the short-time average amplitude difference function is as follows (1):

Step S102: a cumulative average normalized amplitude difference function (abbreviated CMNDF, the same below) is calculated. The minimum position of the short-time average amplitude difference function calculated in step S101 may be the fundamental frequency and the resonant frequency thereof. In order to find the minimum value more smoothly and avoid mistaking 0 or integer multiple of the fundamental frequency as the fundamental frequency, the cumulative average normalization function of the short-time average amplitude difference sequence is firstly calculated and used as the input of the next step. Namely: in order to more accurately analyze the candidate value of the fundamental frequency, the result of step S101 is normalized by the cumulative average, and the CMNDF vector of the tth speech window can be obtained by traversing all τ according to the following formula (2):

the first value of the CMNDF vector is 0 and has a length W.

Step S103: a plurality of candidate fundamental frequencies and probabilities are extracted. According to the characteristics of periodic signals, the minimum value point in the cumulative average normalized amplitude difference function of the voice window can be considered as fundamental frequency or harmonic frequency if the minimum value point is smaller than a certain threshold value at the same time. And establishing a Beta probability distribution model of a threshold, and calculating a cumulative distribution function of the candidate fundamental frequencies to calculate a probability value so as to obtain a candidate fundamental frequency probability pair set of each voice window. Namely: according to the characteristics of the periodic signal, if the minimum value point of the CMNDF vector calculated in step S102 meets a threshold value S smaller than a certain value, it can be considered as a fundamental frequency or a harmonic frequency.

(a) The empirical value of the threshold S is 0.1, where S is considered to be a random variable satisfying the Beta distribution; typical parameters are α ═ 2 and β ═ 18, i.e., S to Beta (2,18), with a mean value of the Beta distribution of 0.1. Taking N as 100 equal scores in the range of 0-1, and taking the value s of a random variable_i1,2, 100, a probability density function P(s) is calculated_i) And a probability density function image of S is plotted (see fig. 2).

(b) Finding the CMNDF vector y_t(τ) set of minima points { τ₀}；

(c) For each minimum point τ₀Calculating the fundamental frequency period T as tau₀Has a probability of

Here []Representing an Everson bracket (Iverson breaker), if the condition in the bracket is satisfied, the expression is 1, otherwise, the expression is 0; p(s)_i) Is a probability density function of the threshold S.

(d) Traversing and calculating a minimal value point set { tau₀Normalizing the probability values to obtain a set formed by a plurality of fundamental frequency period probability pairs of the t-th voice window.

zτ(t)＝{(τ₁,p₁),(τ₂,p₂),...(τ_m,p_m),} (4)

(e) The fundamental frequency is calculated by f ═ fs/τ according to the reciprocal relationship of frequency and period, where fs is the sampling rate of the input audio and a typical value is 44100Hz, thus converting the equation into the set of fundamental frequency probability pairs for the tth speech window as shown in equation (5):

Z(t)＝{(f₁,p₁),(f₂,p₂),...,(f_m,p_m)} (5)

step S104: the dominant fundamental frequency is calculated. The fundamental frequency range of human voice can be divided into a plurality of discrete parts (called bins for short), and any fundamental frequency value can be corresponding to a certain fundamental frequency bin region; taking the sequence number of bin as a Hidden state, modeling the fundamental frequency tracking process as a Hidden Markov process (HMM); establishing a state transition probability matrix between fundamental frequency bins, and calculating an observation probability vector of each voice window through the candidate fundamental frequency probability pair set in the step S102; thus, the fundamental frequency value of each speech window is solved and converted into the decoding process of the HMM, and the main fundamental frequency of the whole audio file can be calculated through a famous Viterbi algorithm. The method specifically comprises the following steps:

this step implements pitch tracking by calculating the fundamental frequency probability pairs of all the frame-divided speech windows of the audio in the previous step S103. Firstly, limiting the fundamental frequency range of human voice, wherein the typical value is 55Hz to 880Hz, and the pitch range of the human voice corresponds to A1 to A5 and the pitch value is 33 to 80 of a piano keyboard, and the total number of the human voice is 48 semitones; to improve the accuracy of the calculation, a semitone range is divided equally into a number of parts, typically 5 (e.g. A1 to # A1 divided into A_1.1,A_1.2,A_1.3,A_1.3,A_1.5) The specific procedure of the algorithm is described below in terms of 5 equal divisions. The whole fundamental frequency corresponds to 240 pitch statistics stacks (abbreviated as bin) with the total of M5 × 48 for the final pitch, the fundamental frequency variation process is modeled as an HMM process, denoted as HMMpitch: for the tth speech window, the pitch variable h_t240 × 2 ═ 480 states of the following formula (6):

h_t～{A_1.1,A_1.2,..,#G_5.5,-A_1.1,-A_1.2,..,-#G_5.5} (6)

pitches with negative signs here indicate the case where the corresponding positive pitch is unvoiced, e.g. -a_1.1Corresponds to A_1.1(ii) a Observed variable O_tValue range of (a) and h_tAnd (3) the following formula (7);

O_t～{A_1.1,A_1.2,..,#G_5.5,-A_1.1,-A_1.2,..,-#G_5.5} (7)

modeling the state transition probability vector of each hidden state into triangular probability distribution (triangular distribution) with the current state as the maximum value, wherein the semitone number deviation of two adjacent voice windows does not exceed a certain distance; typical values for the spacing on both sides are 2.5 semitones, corresponding to 5 bins.

In the set of fundamental frequency probability pairs calculated in step S103, for a certain fundamental frequency probability pair, the nearest bin is found, numbered M, where M is 1,2

Corresponding to other bins

Each speech window utterance and non-utterance is an equivalent probability event, and then the observation probability corresponding to the ith bin of the tth speech window is:

making the initial state probability distribution be uniform distribution, namely equal probability of each state; and simultaneously, taking the bin corresponding to the maximum probability value of the set z (t) as an observed value of the t-th voice window. Knowing the observed value and the probability model of the HMMpitch, calculating the prediction problem of the HMM according to a Viterbi decoding algorithm, and obtaining the fundamental frequency estimation f of the t-th speech window_pitch(t)。

Step S105: calculating the main melody note. This step is for f calculated in step S104_pitchAnd (t) performing smoothing, wherein the smoothing mainly comprises 3 operations of frequency conversion, HMM decoding and note connection.

(I) The frequency is shifted by pitch. First, the frequency value of each speech window is converted into a pitch value according to the following equation (9):

N_note(t)＝12*log₂(f_pitch(t)/440)+69 (9)

(II) HMM decoding. This operation mainly accomplishes smoothing N_note(t) obtaining Q_tHere again, the HMM process is utilized to model the pitch value Q of each speech window_tThe course of the change is marked as HMMnote. The pitch range is the same as step S104, and N is total_sSemitone, typical value of pitch range is 33 ~ 80, N at this moment_s48; each semitone is equally divided into N_bpsBin, N_bpsThe range of (a) is an integer of 2 to 5, and the typical value is 3; there may be 3 states for each bin, denoted as attack (a), stable (b), and silence (c), and N is written_spb3. The state of HMMnote has N_{note_state}＝N_s*N_bps*N_spbTypical value is N_{note_state}＝432。

Let 3 states of the ith bin be { q }_i,aq_i,bq_i,c},i＝1,2,...,N_s*N_spb. Pitch value Q of each speech window_tPresence of N_{note_state}Possible states are converted according to the following probability rules (1) to (5):

(1) the probability of the i-th bin transitioning from state a to state a is a constant, denoted as P (q)_i,aq_i,a)＝P(Q_t＝q_i,a|Q_t-1＝q_i,a) The value range is 0.85-0.95, and the typical value is 0.9;

(2) probability P (q) of state a to transition to state b of ith bin_i,aq_i,b)＝P(Q_t＝q_i,b|Q_t-1＝q_i,a)＝1-P(q_i,aq_i,a)；

(3) The probability of the i-th bin's state b transitioning to state b is a constant, P (q)_i,bq_i,b)＝P(Q_t＝q_i,b|Q_t-1＝q_i,b) The value range is 0.95-0.99, and the typical value is 0.99;

(4) probability P (q) of state b to transition to state c for the ith bin_i,bq_i,c)＝P(Q_t＝q_i,b|Q_t-1＝q_i,b)＝1-P(q_i,bq_i,b) Typical values are 0.01;

(5) the probability of the transition from state c of the ith bin to state a of the jth bin is:

here, the

Represents the semitone difference between the two bins i and j; f. of_note(δ_s) Satisfy a Gaussian distribution, i.e.

Variance (variance)

The value range of (A) is 0.65-0.75, and the typical value is 0.7.

(6) The state transition probability between the other states is 0.

Pitch observation state V_tConsistent with the hidden state. The pitch observation N of the tth speech window is known_note(t), the observation probability is a gaussian distribution of the difference in pitch between the state and the observed value. The observation probability vector of the tth speech window is calculated according to the following formula (10):

here, the first and second liquid crystal display panels are,

representing the pitch difference, the value range of r is 0.05-0.2, and the typical value is 0.1; m_i,a(t)、M_i,b(t)、M_i,c(t) respectively representing the observation probabilities of the 3 different states of the ith bin; f. of_i,a、f_i,b、f_i,cIs the gaussian distribution of 3 different states of the ith bin; the mean of the gaussian distribution is usually taken to be 0 and the variance parameter is typically 1.0.

The initial state probability distribution of the HMMnote process is made to be uniform distribution, so that the pitch value Q of the tth speech window can be calculated again according to the decoding process of the HMM_t。

(III) connecting musical notes. Connecting a plurality of continuous speech windows with equal pitch values and within a limited range in the step (II), and forming a note set note (k) ═ Q in time sequence_i|s_k≤i≤e_k,Q_i＝Q_k}; start time T of note_k,startThe start time of the speech window is started for this note set; end time T_k,endThe end time of the last speech window for this note set; the pitch value, start time and end time form a pair of note triplets (Q)_k,T_k,start,T_k,end) (ii) a And traversing the whole audio file to obtain all note triples, thus obtaining the main melody of the whole audio file.

The invention has the advantages that:

The above embodiments are merely illustrative of the preferred embodiments of the present invention, and not restrictive, and various changes and modifications to the technical solutions of the present invention may be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are intended to fall within the scope of the present invention defined by the appended claims.

Claims

1. The automatic extraction algorithm of the main melody based on the probability model is characterized by comprising the following steps:

s101, calculating a short-time average amplitude difference function;

s104, calculating a main fundamental frequency;

and S105, calculating the main melody notes.

2. The probabilistic model-based automatic melody extraction algorithm according to claim 1, wherein the specific process of step S101 is as follows: dividing audio time into frames, and calculating a short-time average amplitude difference function of each frame of time domain signal;

3. The probabilistic model-based automatic main melody extraction algorithm according to claim 2, wherein the specific process of step S102 is: performing cumulative average normalization on the result of step S101, and traversing all τ according to the following formula to obtain the CMNDF vector of the tth speech window, as shown in the following formula (2):

the first value of the CMNDF vector is 0 and has a length W.

4. The probabilistic model-based automatic melody extraction algorithm according to claim 3, wherein the specific process of step S103 is: if the minimum value point in the cumulative average normalized amplitude difference function of the voice window is smaller than a certain threshold value at the same time, the minimum value point can be regarded as fundamental frequency or harmonic frequency;

5. The probabilistic model-based automatic main melody extraction algorithm according to claim 4, wherein the specific process of step S104 is: dividing the fundamental frequency range of human voice into a plurality of discrete parts (called bins for short), wherein any fundamental frequency value corresponds to a certain fundamental frequency bin region; taking the serial number of bin as a hidden state to track the fundamental frequency;

6. The probabilistic model-based automatic melody extraction algorithm according to claim 5, wherein the specific process of step S105 is: the step is mainly to smooth the fundamental frequency calculated in the step S104 and establish the HMM process of note pitch value; dividing a pitch range into a plurality of bins, wherein each pitch bin can have three states of starting, stabilizing and muting, and the states are converted according to a Gaussian distribution probability model;