Summary of the invention
Embodiment of the invention technical matters to be solved is, a kind of note spectral method and device based on the humming input is provided, and can realize high precision quick identification and music score conversion to the humming input.
For addressing the above problem, the embodiment of the invention provides a kind of note spectral method based on the humming input, comprising:
According to the window function that presets the voice signal that receives is carried out the branch frame and handle, obtain the speech frame of a plurality of correspondences;
Each speech frame is carried out silence detection, obtain the energy of each speech frame;
Each speech frame is carried out Filtering Processing, and filtered signal is carried out three level slicings handle, obtain the corresponding clipped signal of each speech frame;
The corresponding clipped signal of each speech frame is carried out the standard auto-correlation calculate, extract the pitch frame sequence that obtains each speech frame according to result of calculation;
Energy and pitch according to each speech frame carry out the note cutting to said voice signal, obtain the cutting result, said cutting result comprise note, with said note note intensity and note duration one to one;
Based on instrument type and the said cutting result that the user is provided with, obtain music score and output.
Wherein, at the preset window function of said basis the voice signal that receives is carried out the branch frame and handles, obtain the step of the speech frame of a plurality of correspondences, specifically comprise:
According to preset window function the voice signal that receives is carried out the branch frame and handle, obtain the speech frame of a plurality of correspondences;
Move length according to preset frame and each speech frame is carried out frame move processing, make said speech frame overlapped.
Wherein, said each speech frame is carried out silence detection, obtains the step of the energy of each speech frame, specifically comprise:
Calculate the energy and the preservation of each speech frame;
The energy and the preset threshold value of each speech frame are done comparison, if the energy of current speech frame then carries out next step said Filtering Processing greater than said threshold value, otherwise directly the pitch of current speech frame is set to 0.
Wherein, said clipped signal to each speech frame correspondence is carried out the calculating of standard auto-correlation, and extraction obtains the step of the pitch of each speech frame according to result of calculation, specifically comprises:
A: adopt the standard autocorrelation function that the corresponding clipped signal of said each speech frame is calculated, obtain the corresponding autocorrelation function of each speech frame, and in each speech frame, select corresponding autocorrelation function according to specific sampling rate;
B:, the autocorrelation function of the current speech frame of selecting is divided equally according to preset fragmentation value;
C: the size of per two auto-correlation function values in every section autocorrelation function in the current speech frame relatively successively respectively, the auto-correlation function value that every section intermediate value is maximum obtains the autocorrelation function peak value sequence of current speech frame as peak value;
D: write down the pairing positional information of each peak value in the autocorrelation function peak value sequence of said current speech frame;
E: the autocorrelation function peak value sequence to said current speech frame travels through; Confirm pitch position,, obtain the pitch of current speech frame particular sample rate and pitch position substitution pitch computing formula; Repeat said A to E process, until the pitch that obtains each speech frame.
Wherein, said voice signal is carried out the note cutting, obtains cutting result's step, specifically comprise according to the energy and the pitch of each speech frame:
The energy of each speech frame size relatively successively, with energy not only greater than preceding two speech frames but also greater than the energy of the current speech frame of latter two speech frame as peak value, until contrasting a last speech frame, obtain whole voice signal energy peak value sequence;
According to the corresponding pitch of each peak value in the said energy peak sequence, calculate note;
According to each peak value in the said energy peak sequence, calculate and said note note intensity one to one;
According to the time interval between per two peak values in the said energy peak sequence, obtain and said note note duration one to one.
Correspondingly, the embodiment of the invention also provides a kind of manual memory apparatus based on the humming input, and said melody synthesizer based on the humming input comprises: divide frame module, detection module, processing module, computing module, cutting module and output module, wherein:
Said minute frame module is used for according to the window function that presets the voice signal that receives being carried out the branch frame and handles, and obtains the speech frame of a plurality of correspondences;
Said detection module is used for each speech frame that said minute frame module obtains is carried out silence detection, obtains the energy of each speech frame;
Said processing module is used for each speech frame that said minute frame module obtains is carried out Filtering Processing, and filtered signal is carried out three level slicings handle, and obtains the corresponding clipped signal of each speech frame;
Said computing module, the corresponding clipped signal of each speech frame that is used for said processing module is obtained carry out the standard auto-correlation and calculate, and extract the pitch frame sequence that obtains each speech frame according to result of calculation;
Said cutting module; The energy of each speech frame that is used for obtaining according to said detection module and the pitch of each speech frame that said computing module obtains carry out the note cutting to said voice signal; Obtain the cutting result, said cutting result comprise note, with said note note intensity and note duration one to one;
Said output module, the cutting result that instrument type that is used for being provided with according to the user and said cutting module obtain obtains music score and output.
Wherein, frame module comprised in said minute: add window unit and frame moves the unit, wherein:
The said window unit that adds is used for according to preset window function the voice signal that receives being carried out the branch frame and handles, and obtains the speech frame of a plurality of correspondences;
Said frame moves the unit, is used for moving length according to preset frame and adds each speech frame that window unit obtains and carry out frame and move processing said, makes said speech frame overlapped.
Wherein, said detection module comprises: calculating storaging unit, comparing unit and the unit is set, wherein:
Said calculating storaging unit is used to calculate the energy of each speech frame that said minute frame module obtain and preserves;
Said comparing unit is used for the energy of each speech frame that more said calculating storaging unit obtains and the size of preset threshold value;
The said unit that is provided with is used for energy and is set to 0 less than the pitch of the speech frame of preset threshold value.
Wherein, said computing module comprises: sampling unit, segmenting unit, the first contrast unit, record cell and definite unit, wherein:
Said sampling unit; Being used to adopt the standard autocorrelation function that said processing module is obtained the corresponding clipped signal of each speech frame calculates; Obtain the corresponding autocorrelation function of each speech frame, and in each speech frame, select corresponding autocorrelation function according to specific sampling rate;
Said segmenting unit is used for according to preset fragmentation value the autocorrelation function of the current speech frame of selecting being divided equally;
The said first contrast unit; The size that is used for per two auto-correlation function values of every section autocorrelation function that relatively segmenting unit obtains described in the current speech frame successively respectively; The auto-correlation function value that every section intermediate value is maximum obtains the autocorrelation function peak value sequence of current speech frame as peak value;
Said record cell is used for writing down the pairing positional information of each peak value of autocorrelation function peak value sequence of the current speech frame that said contrast unit obtains;
Said definite unit, the autocorrelation function peak value sequence of the current speech frame that is used for said contrast unit is obtained travels through, and confirms pitch position, with particular sample rate and pitch position substitution pitch computing formula, obtains the pitch of current speech frame.
Wherein, said cutting module comprises: the second contrast unit, note unit, note intensity unit and note duration unit, wherein:
The said second contrast unit; The energy size that is used for each speech frame that more said successively calculating storaging unit obtains; With energy not only greater than preceding two speech frames but also greater than the energy of the current speech frame of latter two speech frame as peak value; Until contrasting a last speech frame, obtain whole voice signal energy peak value sequence;
Said note unit, the corresponding pitch of each peak value of the energy peak sequence that is used for obtaining according to the said second contrast unit calculates note;
Said note intensity unit is used for each peak value of energy peak sequence of obtaining according to the said second contrast unit, calculates and said note note intensity one to one;
Said note duration unit in the time interval between per two peak values of energy peak sequence that are used for obtaining according to the said second contrast unit, obtains and said note note duration one to one.
Embodiment of the present invention embodiment has following beneficial effect:
The embodiment of the invention through the voice signal that receives is comprised progressively branch frame, frame move, silence detection, filtering, three level slicings, standard auto-correlation are calculated, peak value is chosen, the processing mode of medium filtering, can obtain the energy and the pitch of each speech frame, and through the further cutting identification to energy and pitch; Obtain obtaining information such as the required note of melody, note intensity, note duration, beat and tone mark; Rapidly, accurately, flexibly; Instrument type in conjunction with user's setting; MIDI (Musica4 Instrument Digita4Interface, musical instrument digital interface) file can either be obtained, numbered musical notation and staff can be obtained again.
Especially, adopt three level slicings, not only can eliminate resonance peak in the voice signal, can also reduce the calculated amount of asking autocorrelation function the influence that pitch extracts.
Embodiment
To combine the accompanying drawing in the embodiment of the invention below, the technical scheme in the embodiment of the invention is carried out clear, intactly description, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are not making the every other embodiment that is obtained under the creative work prerequisite, all belong to the scope of the present invention's protection.
Below in conjunction with accompanying drawing specific embodiment of the present invention is elaborated.
See also Fig. 1; For the present invention is based on the structural representation of first embodiment that hums the melody synthesizer of importing; Said melody synthesizer based on the humming input comprises: divide frame module 10, detection module 20, processing module 30, computing module 40, cutting module 50 and output module 60, wherein:
Frame module 10 in said minute, were used for according to the window function that presets the voice signal that receives being carried out the branch frame and handled, and obtained the speech frame of a plurality of correspondences.
Concrete; User's anywhere at any time arbitrarily hums facing to the melody synthesizer based on the humming input that the embodiment of the invention provides; Said minute frame module 10 can receive the voice signal of user's humming, and adopts window function that said voice signal is carried out the branch frame and handle.
Said minute frame module 10 specifically through the window function that said voice signal is added a length is N, is divided into multiframe with said voice signal and handles.A plurality of speech frames behind the branch frame can use formula 2-1 to express:
X
n(m)=and w (m) * x (n+m), 0≤m≤N-1 (formula 2-1)
Wherein, X
n(m) the expression frame number is m the sampled value of n, and N is a frame length, and w (m) is a window function, in the present embodiment, the employing rectangular window, window function is suc as formula 2-2:
Said detection module 20 is used for each speech frame that said minute frame module 10 obtains is carried out silence detection, obtains the energy of each speech frame.
Concrete, in order in follow-up pitch frame sequence leaching process, to locate pitch more accurately, need carry out silence detection (VAD; Voice Activation Detection); The pitch of quiet part is set to 0, simultaneously, because the volume of pure noise part is very little; Fixing pitch will not be set to 0 yet.
In addition, the energy demand of each speech frame that said inspection module 20 calculates when carrying out silence detection is preserved, and is used for follow-up note cutting.
Said processing module 30 is used for each speech frame that said minute frame module 10 obtains is carried out Filtering Processing, and filtered signal is carried out three level slicings handle, and obtains the corresponding clipped signal of each speech frame.
Concrete, in order to eliminate electromagnetic noise and environmental noise, and resonance peak influence that follow-up pitch is calculated, 30 pairs of said each speech frames of said processing module carry out low-pass filtering treatment.
The pitch of considering voice generally is in the pitch interval of (50,500) Hz, can directly add the low-pass filter of cutoff frequency wc=500hz.Further contemplate humming different with daily dialogue, cutoff frequency wc is set to the arbitrary value between [500,1000] Hz here.
Concrete, low-pass filter can be selected iir (Infinite Impu4se Response, IIR) wave filter or fir (Finite Impu4se Response has the response of limit for length's unit impulse) wave filter for use.The iir filter delay is little, and calculated amount is little, can requirement of real time, still needs consideration stability.The fir Design of Filter is simple, if use the fir wave filter of sampling lower-order number, can reduce delay, can obtain reasonable phase information simultaneously.Be example with the fir wave filter in the present embodiment, the impulse response function of low pass phase filter is suc as formula 2-3:
(formula 2-3)
Wherein, N is a filter length, and N-1 is a filter order, M=(N-1)/2.
In like manner, also can add one and go up lower limiting frequency and be respectively 50 and 500 BPF. each speech frame is carried out Filtering Processing, not give unnecessary details at this.
Next, said processing module 30 is carried out the signal after convolution algorithm obtains LPF according to formula 2-4 to each speech frame.
y
m(n)=h (n) * x
m(n) (formula 2-4)
For the influence of the resonance peak in the further elimination voice to the pitch extraction, and reduce the calculated amount that next step asks autocorrelation function, processing module described in the embodiment of the invention 30 continues each speech frame through LPF is carried out the processing of three level slicings.
The realization functional expression of three level slicings is suc as formula 2-5:
(formula 2-5)
Wherein, CL=r*max [x (i)], 0<i<N-1, the span of r rule of thumb is worth and is set to [0.6,08], generally speaking gets r=0.68 more, and max is the maximizing computing.
Said computing module 40, the corresponding clipped signal of each speech frame that is used for said processing module 30 is obtained carry out the standard auto-correlation and calculate, and extract the pitch that obtains each speech frame according to result of calculation.
Concrete, said computing module 40 adopts the autocorrelation function suc as formula 2-6 that the corresponding clipped signal of each speech frame that said processing module 30 obtains is carried out the calculating of standard auto-correlation.
(formula 2-6)
Wherein, autocorrelation function R
n(k) be an even function, x
n(m) signal after to be the n speech frame through three level slicings handle, N is a frame length.
For R
n(k), can look for pitch position, obtain the pitch of each speech frame through seeking the peak value strategy.
Said cutting module 50; The energy of each speech frame that is used for obtaining according to said detection module 20 and the pitch of each speech frame that said computing module 40 obtains carry out the note cutting to said voice signal; Obtain the cutting result, said cutting result comprise note, with said note note intensity and note duration one to one.
Concrete, said cutting module 50 is carried out the note cutting to the energy and the pitch of each speech frame, obtains comprising note, note intensity and note duration, and said note intensity and note duration are corresponding one by one with said note.
Said cutting module 50 can also obtain the beat of each speech frame.
The tone mark information of each speech frame can use as default, and transfers singing style like C.
Said output module 60, the cutting result that instrument type that is used for being provided with according to the user and said cutting module 50 obtain obtains music score and output.
Concrete; The user can be provided with the musical instrument of any type based on the external interface of the melody synthesizer of humming input through the embodiment of the invention, and the cutting result that instrument type that said output module 60 is provided with according to the user and said cutting module 50 obtain is that note, note intensity, note duration utilize synthetic melody and the output that obtains midi format of frequency modulation synthetic method.
Simultaneously, beat and tone mark information in conjunction with each speech frame can also obtain numbered musical notation and staff.
The processing mode of the embodiment of the invention through the voice signal that receives being comprised progressively branch frame, silence detection, filtering, three level slicings, the calculating of standard auto-correlation, peak value are chosen; Can obtain the energy and the pitch of each speech frame, and, obtain obtaining information such as the required note of melody, note intensity, note duration, beat and tone mark through further cutting identification to energy and pitch; Rapidly; Accurately, flexibly, in conjunction with the instrument type of user's setting; The MIDI file can either be obtained, numbered musical notation and staff can be obtained again.
Especially, adopt three level slicings, not only can eliminate resonance peak in the voice signal, can also reduce the calculated amount of asking autocorrelation function the influence that pitch extracts.
See also Fig. 2; For the present invention is based on the structural representation of second embodiment that hums the melody synthesizer of importing; Said melody synthesizer based on humming input comprises among first embodiment of above-mentioned melody synthesizer based on the humming input: divide frame module 10, detection module 20, processing module 30, computing module 40, cutting module 50 and output module 60; In the present embodiment, said melody synthesizer based on the humming input further comprises: remove module 70.
Said removal module 70, the pitch frame sequence that the pitch that is used for said computing module 40 is obtained is formed are carried out medium filtering and are handled, and removing wild point, said wild point is the obvious pitch different with overall trend in the said pitch frame sequence.
Concrete, in order to improve computational accuracy, remove wild point, the pitch frame sequence that the pitch that 70 pairs of said computing modules 40 of said removal module obtain is formed further carries out medium filtering to be handled.Wherein, said wild point is the obvious pitch different with overall trend in the said pitch frame sequence.
See also Fig. 3, be the structural representation of an embodiment of the branch frame module 10 among Fig. 2, frame module 10 specifically comprised in said minute:
Add window unit 101, be used for the voice signal that receives being carried out the branch frame and handle, obtain the speech frame of a plurality of correspondences according to preset window function.
Frame moves unit 102, is used for moving length according to preset frame and adds each speech frame that window unit 101 obtains and carry out frame and move processing said, makes said speech frame overlapped.
Concrete, more level and smooth in order to make analysis result, the frequency spectrum that reduces the windowing generation blocks and the spectral leakage effect; Avoid gibbs effect (the Gibbs phenomenon of generation synthetic the time; Gibbs' effect), need overlappingly between frame and the frame, establishing overlap length is O; Frame length is N, and then frame moves long S=N-O.
Said frame moves unit 102 and adds each speech frame that window unit 101 obtains and move S=N-O length said, makes said speech frame overlapped, and overlap length is O.Therefore in subsequent calculations, each frame all will be calculated twice with the overlapping part of back one frame, though increased calculated amount like this, more meets the principle of asking pitch, and precision is higher, and error reduces.Need to prove; Because other values that auto-correlation is actually in each value and the sequence all have relation; The size of value is only just meaningful in sequence; Therefore the size of the auto-correlation function value that in different sequences, obtains of the point of same position is different, that is to say that the auto-correlation function value that the overlapping part of each frame and back one frame obtains after through autocorrelation function calculating in this frame and back one frame is different.
See also Fig. 4, be the structural representation of an embodiment of the detection module among Fig. 2 20, said detection module 20 specifically comprises:
Calculating storaging unit 201 is used to calculate the energy of each speech frame that said minute frame module 10 obtain and preserves.
Concrete, silence detection has dual mode, and short-time energy method (suc as formula 2-7) and short-time average magnitude method (formula 2-8) are represented with En and Mn respectively:
(formula 2-7)
(formula 2-8)
Compare mutually, the computing velocity of short-time average magnitude method is very fast.The embodiment of the invention selects for use the short-time energy method to calculate, and the energy of each speech frame that obtains also saves as energy frame sequence E, is used for follow-up note cutting.
Comparing unit 202 is used for the energy of each speech frame that more said calculating storaging unit 201 obtains and the size of preset threshold value.
Concrete, comparing unit 202 judges through predetermined threshold value Thr whether each speech frame is quiet.When the energy of speech frame is that right and wrong are quiet during greater than Thr, otherwise be quiet.
Unit 203 is set, is used for energy and is set to 0 less than the pitch of the speech frame of preset threshold value.
Concrete, said comparing unit 202 is judged the current speech frame for when quiet, the said pitch that unit 203 direct current speech frames are set is set to 0, does not continue that it is carried out pitch and asks for.
See also Fig. 5, be the structural representation of an embodiment of the computing module among Fig. 2 40, said computing module 40 comprises: sampling unit 401, segmenting unit 402, the first contrast unit 403, record cell 404 and definite unit 405, wherein:
Said sampling unit 401; Be used to adopt the standard autocorrelation function that the corresponding clipped signal of each speech frame that said processing module 30 obtains is calculated; Obtain the corresponding auto-correlation function value of each speech frame, and in each speech frame, obtain corresponding auto-correlation function value according to specific sampling rate;
Concrete, because the scope of pitch period is (2,20) ms, for the sampling rate of 8k, said sampling unit 401 obtains that index is the auto-correlation function value R of k=16~160 in each speech frame
n(k); For the sampling rate of 16k, said sampling unit 401 obtains that index is the auto-correlation function value R of k=32~320 in each speech frame
n(k).Therefore the covering scope of auto-correlation function value is wide, and precision is high, guarantees next can detect all pitches, not omission.
In embodiments of the present invention, said sampling unit 401 is selected R
n(k) index is the auto-correlation function value of k=32~320 in, and promptly sampling rate is 16k.
Said segmenting unit 402 is used for according to preset fragmentation value the autocorrelation function of the current speech frame of selecting being divided equally.
Concrete, said segmenting unit 402 is with R
n(k) index is that the point of k=32~320 is divided into the L section in, and then every section has 320-31/L auto-correlation function value.Generally in order to reach the balance between computational accuracy and the computing velocity, multiple preset L=4 certainly in order to improve computational accuracy, also can be provided with than 4 bigger fragmentation values, but the unavoidable like this speed that can reduce calculating.Suppose L=4 in the present embodiment.
The said first contrast unit 403; The size that is used for per two auto-correlation function values of every section auto-correlation function value that relatively segmenting unit 402 obtains described in the current speech frame successively respectively; The auto-correlation function value that every section intermediate value is maximum obtains the autocorrelation function peak value sequence of current speech frame as peak value;
Concrete, the autocorrelation function peak value sequence p (i) in each speech frame is sought according to following mode in the said first contrast unit 403:
The said first contrast unit 403 is sought maximum auto-correlation function value respectively in every section auto-correlation function value that said segmenting unit 402 obtains; For the 1st section auto-correlation function value; More per successively two auto-correlation function values until contrasting 320-31/4, obtain maximum one of auto-correlation function value in the 1st section; The rest may be inferred, in 4 sections, obtains 4 maximum auto-correlation function values;
The said first contrast unit 403 as peak value, obtains the autocorrelation function peak value sequence p (i) of current speech frame with 4 maximum auto-correlation function values.
Said record cell 404 is used for writing down the pairing positional information of each peak value of autocorrelation function peak value sequence of the current speech frame that said contrast unit 403 obtains.
Concrete, said record cell 404 is noted among the p (i) each peak value corresponding to R
n(k) the positional information ID in (i), 0≤k≤L-1 has 4 positional informations.
Said definite unit 405, the autocorrelation function peak value sequence of the current speech frame that is used for said contrast unit 404 is obtained travels through, and confirms pitch position, with particular sample rate and pitch position substitution pitch computing formula, obtains the pitch of current speech frame.
Concrete, the said 405 couples of p in definite unit (i) travel through, and seek out pitch position pos according to formula 2-9:
Wherein, the span of r is [0.6,0.8], and each peak value among the p (i) is arranged in order by descending order generally speaking, then p
MaxBe first peak value among the p (i), pos is the ID (i) of first peak value among the p (i).Certainly, can not get rid of the special circumstances that each peak value of occurring among the p (i) is not arranged in order by from big to small order.
For guaranteeing the accurate of pos, need travel through backward p (i) according to formula 2-9: p in the past
MaxInitial value be first peak value among the p (i), if a back peak value multiply by a span for also big than its previous peak value behind the decimal of [0.6,0.8], then show a back peak value much larger than its previous peak value, a promptly back peak value is p
Max, otherwise p
MaxStill get previous peak value.Travel through backward successively according to this process, last peak value in traversing p (i) is confirmed p
MaxCorresponding pos.
With the value substitution formula 2-10 of pos and sample rate f s, can try to achieve the pitch value pitch of current speech frame.
(formula 2-10)
Repeat said process, can obtain the pitch of each speech frame respectively, the pitch value combination by each speech frame obtains pitch frame sequence ph then.
See also Fig. 6, be the structural representation of an embodiment of the cutting module 50 among Fig. 2, said cutting module comprises: the second contrast unit 501, note unit 502, note intensity unit 503 and note duration unit 504, wherein:
The said second contrast unit 501; The energy size that is used for each speech frame that more said successively calculating storaging unit 201 obtains; With energy not only greater than preceding two speech frames but also greater than the energy of the current speech frame of latter two speech frame as peak value; Until contrasting a last speech frame, obtain whole voice signal energy peak value sequence.
Concrete; The pitch frame sequence ph that the pitch that energy frame sequence E that the energy of each speech frame that obtains for calculating storaging unit 201 is formed and said computing module 40 obtain each speech frame is formed, peak values all among the E is sought according to following mode in the said second contrast unit 501:
The energy size of each speech frame that the more said successively calculating storaging unit 201 in the said second contrast unit 501 obtains; With energy not only greater than preceding two speech frames but also greater than the energy of the current speech frame of latter two speech frame as peak value; Until contrasting a last speech frame; Obtain whole voice signal energy peak value sequence pk, thereby obtain the index sequence IND of peak value.
Said note unit 502, the corresponding pitch of each peak value of the energy peak sequence that is used for obtaining according to the said second contrast unit 501 calculates note.
Concrete, said note unit 502 is the corresponding index IND place of each peak value in pk, in ph, takes out the corresponding pitch value of this index, as a new note pitch np.Calculate note note according to formula 2-11:
(formula 2-11)
Said note intensity unit 503 is used for each peak value of energy peak sequence of obtaining according to the said second contrast unit 502, calculates and said note note intensity one to one.
Concrete, said note intensity unit 503 calculates the note intensity A one to one with note note according to each the peak value substitution formula 2-12 among the pk:
(formula 2-12)
Wherein, the embodiment of the invention adopts formula 2-12 to calculate note intensity, more meets the relation between sinusoidal signal average energy and the maximum amplitude.And A has been carried out saturation arithmetic, prevented to overflow.
Said note duration unit 504 in the time interval between per two peak values of energy peak sequence that are used for obtaining according to the said second contrast unit 502, obtains and said note note duration one to one.
Concrete, said note duration unit 504 obtains the note duration Tn one to one with note note according to the time interval between per two peak values among the pk.
Simultaneously, said cutting module 50 also can obtain beat and tone mark information.
Concrete; Said cutting module 50 can be obtained beat information through the frequency that valley among the analysing energy frame sequence pk occurs; The frequency that valley occurs among the pk specifically obtains through autocorrelation function being asked for the cycle through said computing module 40; Replace being similar to the process of asking for of pitch frame sequence with the energy frame sequence, do not give unnecessary details at this through the signal after the processing of three level slicings.
Tone mark information can use as default, and transfers singing style like C.
The embodiment of the invention through the voice signal that receives is comprised progressively branch frame, frame move, silence detection, filtering, three level slicings, standard auto-correlation are calculated, peak value is chosen, the processing mode of medium filtering; Can obtain the energy and the pitch of each speech frame, and, obtain obtaining information such as the required note of melody, note intensity, note duration, beat and tone mark through further cutting identification to energy and pitch; Rapidly; Accurately, flexibly, in conjunction with the instrument type of user's setting; The MIDI file can either be obtained, numbered musical notation and staff can be obtained again.
Especially, adopt three level slicings, not only can eliminate resonance peak in the voice signal, can also reduce the calculated amount of asking autocorrelation function the influence that pitch extracts.
See also Fig. 7, for the present invention is based on the process flow diagram of first embodiment that hums the note spectral method of importing, said method comprises:
S101 carries out the branch frame according to the window function that presets to the voice signal that receives and handles, and obtains the speech frame of a plurality of correspondences.
Concrete; User's anywhere at any time arbitrarily hums facing to the melody synthesizer based on the humming input that the embodiment of the invention provides; Said S101 can receive the voice signal of user's humming, and adopts window function that said voice signal is carried out the branch frame and handle.
Said S101 specifically through the window function that said voice signal is added a length is N, is divided into multiframe with said voice signal and handles.A plurality of speech frames behind the branch frame can use formula 2-1 to express:
X
n(m)=and w (m) * x (n+m), 0≤m≤N-1 (formula 2-1)
Wherein, X
n(m) the expression frame number is m the sampled value of n, and N is a frame length, and w (m) is a window function, in the present embodiment, the employing rectangular window, window function is suc as formula 2-2:
S102 carries out silence detection to each speech frame, obtains the energy of each speech frame.
Concrete, in order in follow-up pitch frame sequence leaching process, to locate pitch more accurately, need carry out silence detection, the pitch of quiet part is set to 0, and simultaneously, because the volume of pure noise part is very little, fixing pitch will not be set to 0 yet.
In addition, the energy demand of each speech frame that said S102 calculates when carrying out silence detection is preserved, and is used for follow-up note cutting.
S103 carries out Filtering Processing to each speech frame, and filtered signal is carried out three level slicings handle, and obtains the corresponding clipped signal of each speech frame.
Concrete, in order to eliminate electromagnetic noise and environmental noise, and resonance peak influence that follow-up pitch is calculated, said S103 carries out low-pass filtering treatment to said each speech frame.
The pitch of considering voice generally is in the pitch interval of (50,500) Hz, can directly add the low-pass filter of cutoff frequency wc=500hz.Further contemplate humming different with daily dialogue, cutoff frequency wc is set to the arbitrary value between [500,1000] Hz here.
Concrete, low-pass filter can be selected iir wave filter or fir wave filter for use.The iir filter delay is little, and calculated amount is little, can requirement of real time, still needs consideration stability.The fir Design of Filter is simple, if use the fir wave filter of sampling lower-order number, can reduce delay, can obtain reasonable phase information simultaneously.Be example with the fir wave filter in the present embodiment, the impulse response function of low pass phase filter is suc as formula 2-3:
(formula 2-3)
Wherein, N is a filter length, and N-1 is a filter order, M=(N-1)/2.
In like manner, also can add one and go up lower limiting frequency and be respectively 50 and 500 BPF. each speech frame is carried out Filtering Processing, not give unnecessary details at this.
Next, said S103 carries out the signal after convolution algorithm obtains LPF according to formula 2-4 to each speech frame.
y
m(n)=h (n) * x
m(n) (formula 2-4)
For the influence of the resonance peak in the further elimination voice to the pitch extraction, and reduce the calculated amount that next step asks autocorrelation function, S103 described in the embodiment of the invention continues each speech frame through LPF is carried out the processing of three level slicings.
The realization functional expression of three level slicings is suc as formula 2-5:
(formula 2-5)
Wherein, CL=r*max [x (i)], 0<i<N-1, the span of r rule of thumb is worth and is set to [0.6,08], generally speaking gets r=0.68 more, and max is the maximizing computing.
S104 carries out the standard auto-correlation to the corresponding clipped signal of each speech frame and calculates, and extracts the pitch that obtains each speech frame according to result of calculation.
Concrete, said S104 adopts the autocorrelation function suc as formula 2-6 that the corresponding clipped signal of each speech frame that said S103 obtains is carried out the calculating of standard auto-correlation.
(formula 2-6)
Wherein, autocorrelation function R
n(k) be an even function, x
n(m) signal after to be the n speech frame through three level slicings handle, N is a frame length.
For R
n(k), can look for pitch position, obtain the pitch of each speech frame through seeking the peak value strategy.
S105 carries out the note cutting according to the pitch frame sequence of each speech frame to said voice signal, obtains the cutting result.
Concrete, said S105 carries out the note cutting to the pitch frame sequence of each speech frame, obtains comprising note, note intensity and note duration, and said note intensity and note duration are corresponding one by one with said note.
Said S105 can also obtain the beat of each speech frame.
The tone mark information of each speech frame can use as default, and transfers singing style like C.
S106 based on instrument type and the said cutting result that the user is provided with, obtains music score and output.
Concrete; The user can be provided with the musical instrument of any type based on the external interface of the melody synthesizer of humming input through the embodiment of the invention, and the cutting result that instrument type that said S106 is provided with according to the user and said S105 obtain is that note, note intensity, note duration utilize synthetic melody and the output that obtains midi format of frequency modulation synthetic method.
Simultaneously, beat and tone mark information in conjunction with each speech frame can also obtain numbered musical notation and staff.
The processing mode of the embodiment of the invention through the voice signal that receives being comprised progressively branch frame, silence detection, filtering, three level slicings, the calculating of standard auto-correlation, peak value are chosen; Can obtain the energy and the pitch of each speech frame, and, obtain obtaining information such as the required note of melody, note intensity, note duration, beat and tone mark through further cutting identification to energy and pitch; Rapidly; Accurately, flexibly, in conjunction with the instrument type of user's setting; The MIDI file can either be obtained, numbered musical notation and staff can be obtained again.
Especially, adopt three level slicings, not only can eliminate resonance peak in the voice signal, can also reduce the calculated amount of asking autocorrelation function the influence that pitch extracts.
See also Fig. 8, for the present invention is based on the process flow diagram of second embodiment that hums the note spectral method of importing, said method comprises:
S201 carries out the branch frame according to the window function that presets to the voice signal that receives and handles, and obtains the speech frame of a plurality of correspondences.
S202 moves length according to preset frame and each speech frame is carried out frame moves processing, makes said speech frame overlapped.
S203 carries out silence detection to each speech frame, obtains the energy of each speech frame.
S204 carries out Filtering Processing to each speech frame, and filtered signal is carried out three level slicings handle, and obtains the corresponding clipped signal of each speech frame.
S205 carries out the standard auto-correlation to the corresponding clipped signal of each speech frame and calculates, and extracts the pitch that obtains each speech frame according to result of calculation.
S206, the pitch frame sequence that said pitch is formed carries out the medium filtering processing, to remove wild point.
S207 carries out the note cutting according to the energy and the pitch of each speech frame to said voice signal, obtains the cutting result.
S208 based on instrument type and the said cutting result that the user is provided with, obtains music score and output.
See also Fig. 9, for the note spectral method based on humming input of the present invention each speech frame is carried out the process flow diagram of silence detection, comprising:
S2031, the energy and the preservation of calculating each speech frame;
Whether S2032, the energy of judging each speech frame greater than preset threshold value, if, then carry out S2033, otherwise, S2034 carried out.
S2033 carries out Filtering Processing to the current speech frame.
Concrete; Said S2033 is the S204 shown in Fig. 8; The energy of promptly judging the current speech frame as S2032 is during greater than preset threshold value; Said S2033 then the current speech frame is carried out Filtering Processing and three level slicings are handled, and continues the S205 shown in the current speech frame execution graph 8 is calculated the pitch of current speech frame.
S2034, the pitch of current speech frame is set to 0.
The embodiment of the invention is through carrying out silence detection to each speech frame; Energy directly is set to 0 less than the pitch of the speech frame of threshold value; The treatment capacity of next step Filtering Processing and three level slicings processing and the calculated amount that follow-up pitch calculates can be reduced, the processing speed of speech frame can be accelerated.
See also Figure 10, carry out the standard auto-correlation for the clipped signal to each speech frame correspondence based on the note spectral method of humming input of the present invention and calculate, extraction obtains the process flow diagram of the pitch frame sequence of each speech frame according to result of calculation, comprising:
S2051 adopts the standard autocorrelation function that the corresponding clipped signal of said each speech frame is calculated, and obtains the corresponding autocorrelation function of each speech frame, and in each speech frame, selects corresponding autocorrelation function according to specific sampling rate;
S2052 according to preset fragmentation value, divides equally the autocorrelation function of the current speech frame of selecting;
S2053, the size of per two auto-correlation function values in every section autocorrelation function in the current speech frame relatively successively respectively, the auto-correlation function value that every section intermediate value is maximum obtains the autocorrelation function peak value sequence of current speech frame as peak value;
S2054 writes down the pairing positional information of each peak value in the autocorrelation function peak value sequence of said current speech frame;
S2055 travels through the autocorrelation function peak value sequence of said current speech frame, confirms pitch position, with particular sample rate and pitch position substitution pitch computing formula, obtains the pitch of current speech frame;
Repeat said S2051~S2055, can obtain the pitch value of each speech frame respectively, the pitch value combination by each speech frame obtains the pitch frame sequence then.
See also Figure 11, the pitch frame sequence of each speech frame that obtains for the basis of the note spectral method based on humming input of the present invention accomplish to the note cutting of said voice signal process flow diagram, comprising:
S2071; Relatively the energy of each speech frame is big or small successively; With energy not only greater than preceding two speech frames but also greater than the energy of the current speech frame of latter two speech frame as peak value, until contrasting a last speech frame, obtain whole voice signal energy peak value sequence;
S2072 according to the corresponding pitch of each peak value in the said energy peak sequence, calculates note;
S2073 according to each peak value in the said energy peak sequence, calculates and said note note intensity one to one;
S2074 according to the time interval between per two peak values in the said energy peak sequence, obtains and said note note duration one to one.
The embodiment of the invention through the voice signal that receives is comprised progressively branch frame, frame move, silence detection, filtering, three level slicings, standard auto-correlation are calculated, peak value is chosen, the processing mode of medium filtering; Can obtain the energy and the pitch of each speech frame, and, obtain obtaining information such as the required note of melody, note intensity, note duration, beat and tone mark through further cutting identification to energy and pitch; The instrument type that is provided with according to the user at last obtains music score and output; Rapidly, accurately, flexibly; The MIDI file can either be obtained, numbered musical notation and staff can be obtained again.
Especially, adopt three level slicings, not only can eliminate resonance peak in the voice signal, can also reduce the calculated amount of asking autocorrelation function the influence that pitch extracts.
Through the description of above embodiment, those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential hardware platform, can certainly all implement through hardware.Based on such understanding; All or part of can the coming out that technical scheme of the present invention contributes to background technology with the embodied of software product; This computer software product can be stored in the storage medium, like ROM/RAM, magnetic disc, CD etc., comprises that some instructions are with so that a computer equipment (can be a personal computer; Server, the perhaps network equipment etc.) carry out the described method of some part of each embodiment of the present invention or embodiment.
Above disclosedly be merely a kind of preferred embodiment of the present invention, can not limit the present invention's interest field certainly with this, the equivalent variations of therefore doing according to claim of the present invention still belongs to the scope that the present invention is contained.