CN105359211A - Unvoiced/voiced decision for speech processing - Google Patents

Unvoiced/voiced decision for speech processing Download PDF

Info

Publication number
CN105359211A
CN105359211A CN201480038204.2A CN201480038204A CN105359211A CN 105359211 A CN105359211 A CN 105359211A CN 201480038204 A CN201480038204 A CN 201480038204A CN 105359211 A CN105359211 A CN 105359211A
Authority
CN
China
Prior art keywords
parameter
sound
voiced
voiceless
voiceless sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201480038204.2A
Other languages
Chinese (zh)
Other versions
CN105359211B (en
Inventor
高扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201910358523.6A priority Critical patent/CN110097896B/en
Publication of CN105359211A publication Critical patent/CN105359211A/en
Application granted granted Critical
Publication of CN105359211B publication Critical patent/CN105359211B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/22Mode decision, i.e. based on audio signal content versus external parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Time-Division Multiplex Systems (AREA)
  • Telephone Function (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

In accordance with an embodiment of the present invention, a method for speech processing includes determining an unvoicing/voicing parameter reflecting a characteristic of unvoiced/voicing speech in a current frame of a speech signal comprising a plurality of frames. A smoothed unvoicing/voicing parameter is determined to include information of the unvoicing/voicing parameter in a frame prior to the current frame of the speech signal. A difference between the unvoicing/voicing parameter and the smoothed unvoicing/voicing parameter is computed. The method further includes generating an unvoiced/voiced decision point for determining whether the current frame comprises unvoiced speech or voiced speech using the computed difference as a decision parameter.

Description

Voiceless sound/voiced sound the judgement of speech processes
The denomination of invention that application claims is submitted on September 3rd, 2014 is the 14/476th of " voiceless sound/voiced sound judgement (Unvoiced/VoicedDecisionforSpeechProcessing) of speech processes " the, the earlier application right of priority of No. 547 U.S. Patent applications, to be the denomination of invention submitted on September 9th, 2013 be this earlier application " voiceless sound/voiced sound judgement (ImprovedUnvoiced/VoicedDecisionforSpeechCoding/Bandwidth Extension/SpeechEnhancement) of the improvement of voice coding/bandwidth expansion/speech enhan-cement " the 61/875th, the continuous application case of No. 198 U.S. Provisional Patent Application cases, these two earlier applications are all incorporated in Ben Wenben in the mode introduced, as reproduced in full.
Technical field
The present invention relates generally to speech processes field, particularly relates to the voiced/unvoiced decision method for speech processes.
Background technology
Voice coding refers to a kind of process reducing the bit rate of voice document.Voice coding is the application that a kind of digital audio and video signals to comprising voice carries out data compression.Voice coding uses certain speech parameters to estimate to carry out analog voice signal by Audio Signal Processing technology, represents the analog parameter of gained in conjunction with generic data compression algorithm with compact bit stream.The object of voice coding is in order to the bit number by reducing each sample realizes saving the object of required memory storage space, transmission bandwidth and through-put power, and (the decompressing) voice that make to decode perceptually are being difficult to raw tone distinguish.
But speech coder is lossy encoder, that is, decoded signal is different from original signal.Therefore, one of target of voice coding is to make distortion (or can perception loss) minimize under bit rates, or makes bit rate minimize to reach given distortion.
Voice coding and the difference of the audio coding of other form are that voice are a kind of signals more simply than other sound signal of great majority, and more about the statistical information of characteristics of speech sounds.Therefore, some relevant to audio coding auditory informations can be unnecessary in voice coding context.In voice coding, most important principle is the sharpness and " the joyful degree " that ensure voice by limited transmission data.
The sharpness of voice, except comprising actual word content, also comprise speaker's identity, mood, intonation, tone color, all these are all very important for best sharpness.The joyful degree of degeneration voice is comparatively abstract concepts, and it is the characteristic being different from sharpness, because degeneration voice are likely completely clearly, but subjective another audience is sick of.
The redundancy of speech waveform is relevant with dissimilar voice signal, such as voiced sound and unvoiced speech signal.Voiced sound, such as ' a ', ' b ' substantially because the vibration of vocal cords produces, and are vibrations.Therefore, within cycle short period, can be simulated them well by the superposition of the cyclical signals such as sine.In other words, voiced speech signal is periodic substantially.But this periodicity may be change within the duration of sound bite, and the shape of periodic wave little by little changes from a fragment to another fragment usually.Low bit rate speech coding can benefit from this periodicity of research greatly.The voiced speech cycle is also called as fundamental tone (pitch), and Pitch Prediction is commonly called long-term forecasting (LTP).By contrast, voiceless sound, such as ' s ', ' sh ', more as noise.This is because unvoiced speech signal is more as a kind of random noise, and there is less predictability.
Traditionally, all parametric speech coding methods utilize the redundancy of voice signal inherence to reduce transmission quantity of information and in short interval the parameter of the speech samples of estimated signal.This redundancy is mainly repeated with speed quasi-periodicity by speech waveform, and the spectrum envelope change of voice signal slowly causes.
Can with reference to some dissimilar voice signals, such as voiced sound and voiceless sound, consider the redundancy of speech waveform.Although voiced speech signal is periodic substantially, this periodicity may be change within the duration of sound bite, and the shape of periodic wave little by little changes along with fragment usually.Low bit rate speech coding can benefit from this periodicity of research greatly.The voiced speech cycle is also called as fundamental tone, and Pitch Prediction is commonly called long-term forecasting (LTP).As for unvoiced speech, signal more as a kind of random noise, and has less predictability.
In either case, parameter coding may be used for by the excitation components of voice signal is separated with spectrum envelope component the redundancy reducing sound bite.The spectrum envelope of slow change can pass through linear predictive coding (LPC), represents also referred to as short-term forecasting (STP).Low bit rate speech coding can also benefit from research this kind of short-term forecasting greatly.The advantage of coding comes from the slow change of parameter.But these parameters and the value kept in several milliseconds are not obviously both rarely found.Correspondingly, under the sampling rate of 8kHz, 12.8kHz or 16kHz, the scope of the nominal frame duration that speech coding algorithm adopts is in ten to three ten milliseconds.The frame duration of 20 milliseconds is modal selection.
In nearest famous standard, such as G.723.1, G.729, G.718, enhanced full rate (EFR), selectable modes vocoder (SMV), adaptive multi-rate (AMR), variable bit rate multi-mode wideband (VMR-WB), or in AMR-WB (AMR-WB), have employed Code Excited Linear Prediction technology (" CELP ").CELP is generally understood as the technical combinations of code-excited, long-term forecasting and short-term forecasting.CELP mainly utilizes human sound characteristic or mankind's voice sonification model to encode to voice signal.CELP voice coding is a kind of very general algorithm principle in compress speech field, although the CELP details in different codec may have a great difference.Due to its ubiquity, CELP algorithm has been applied in the various standard such as ITU-T, MPEG, 3GPP and 3GPP2.The variant of CELP comprises algebraically CELP, broad sense CELP, low time delay CELP and vector sum excited linear prediction, and other.CELP is the generic term of a class algorithm, instead of refers to specific codec.
CELP algorithm is based on four main points of view.The first, use the source filter model for speech production by linear prediction (LP).Be sound source, such as vocal cords for the source filter of speech production by speech simulation, and linear acoustic filter, the i.e. combination of sound channel (and radiation characteristic).In the embodiment of the source filter model of speech production, sound source or pumping signal are modeled as the periodic pulse train of voiced speech usually, or the white noise of unvoiced speech.The second, self-adaptation and fixed codebook are used as the input (excitation) of LP model.3rd, perform search in the closed loop in " perceptual weighting territory ".4th, employ vector quantization (VQ).
Summary of the invention
According to one embodiment of the invention, a kind of method of speech processing comprises determines voiceless sound/voiced sound parameter, and described voiceless sound/voiced sound parameter reflects the voiceless sound/voiced speech characteristic comprised in the present frame of the voice signal of multiple frame.Determine level and smooth after voiceless sound/voiced sound parameter, described level and smooth after voiceless sound/voiced sound parameter comprise before the described present frame of described voice signal frame in the information of voiceless sound/voiced sound parameter.Calculate described voiceless sound/voiced sound parameter and described level and smooth after voiceless sound/voiced sound parameter between difference.Described method also comprises the difference that calculates described in use as decision parameter to generate voiceless sound/voiced sound determination point, and this voiceless sound/voiced sound determination point is used for determining whether described present frame comprises unvoiced speech or voiced speech.
In an optional embodiment, a kind of voice processing apparatus comprises processor, and stores the computer-readable recording medium of the program performed by described processor.Described program comprises the instruction performing following operation: determine voiceless sound/voiced sound parameter, described voiceless sound/voiced sound parameter reflects the voiceless sound/voiced speech feature comprised in the present frame of the voice signal of multiple frame; And determine level and smooth after voiceless sound/voiced sound parameter comprise before the described present frame of described voice signal frame in the information of voiceless sound/voiced sound parameter.Described program also comprises the instruction performing following operation: calculate described voiceless sound/voiced sound parameter and described level and smooth after voiceless sound/voiced sound parameter between difference, and the difference that calculates is as decision parameter to generate voiceless sound/voiced sound determination point described in using, this voiceless sound/voiced sound determination point is used for determining whether described present frame comprises unvoiced speech or voiced speech.
In an optional embodiment, method of speech processing comprises the multiple frames providing voice signal, and determines the first parameter for present frame from the first frequency band described voice signal the first energy envelope in the time domain and determine the second parameter from the second frequency band described voice signal the second energy envelope in the time domain.Determine from the previous frame of described voice signal level and smooth after the first parameter and level and smooth after the second parameter.By described first parameter and described level and smooth after the first parameter compare, and by described second parameter with described smoothly after the second parameter compare.Using described comparative result as decision parameter to generate voiceless sound/voiced sound determination point, this voiceless sound/voiced sound determination point is used for determining whether described present frame comprises unvoiced speech or voiced speech.
Accompanying drawing explanation
In order to more completely understand the present invention and advantage thereof, with reference now to the description hereafter carried out by reference to the accompanying drawings, wherein:
Fig. 1 shows and assesses according to the time domain energy of the lowband speech signal of the embodiment of the present invention;
Fig. 2 shows and assesses according to the time domain energy of the high-band speech signal of the embodiment of the present invention;
Fig. 3 shows the operation performed during using the traditional CELP encoders implementing the embodiment of the present invention to encode to raw tone;
Fig. 4 shows the operation performed during using the traditional C ELP demoder implementing the embodiment of the present invention to decode to raw tone;
Fig. 5 shows the traditional CELP encoders used when implementing the embodiment of the present invention;
Fig. 6 shows the basic CELP decoder corresponding to the scrambler in Fig. 5 according to the embodiment of the present invention;
Fig. 7 shows the noisy-type candidate vector for the code-excited code book or fixed codebook building CELP voice coding;
Fig. 8 shows the impulse type candidate vector for the code-excited code book or fixed codebook building CELP voice coding;
Fig. 9 shows the example of the excitation spectrum of voiced speech;
Figure 10 shows the example of the excitation spectrum of unvoiced speech;
Figure 11 shows the example of the excitation spectrum of ambient noise signal;
Figure 12 A and 12B shows the example of the Frequency Domain Coding/decoding utilizing bandwidth expansion, and wherein Figure 12 A shows the scrambler with BWE side information, and Figure 12 B shows the demoder with BWE;
Figure 13 A to 13C describes and operates according to the speech processes of above-described various embodiment;
Figure 14 shows the communication system 10 according to the embodiment of the present invention; And
Figure 15 shows the block scheme that may be used for the disposal system implementing equipment disclosed herein and method.
Embodiment
In contemporary audio/voice digital signal communication system, digital signal is in the compression of scrambler place, and compressed information or bit stream can be packed and be sent to demoder frame by frame by communication channel.Decoder accepts and decode compressed information to obtain audio/speech digital signal.
In order to more efficiently encode to voice signal, can be inhomogeneity by classification of speech signals, and in a different manner every class be encoded.Such as, G.718, in some standards such as VRM-WB or AMR-WB, be UNVOICED, TRANSITION, GENERIC, VOICED and NOISE by classification of speech signals.
Voiced speech signal is a kind of signal of quasi periodic type, the energy of this signal at low-frequency region than many in high-frequency region.By contrast, unvoiced speech signal is a kind of noisy-type signal, the energy of this signal in high-frequency region than many at low-frequency region.Voiceless sound/voiced sound classification or voiceless sound judgement are widely used in speech signal coding, speech signal bandwidth expansion, voice signals enhancement and voice signal ground unrest and reduce (NR) field.
In voice coding, coding/decoding can be carried out to unvoiced speech signal and voiced speech signal by different way.In speech signal bandwidth expansion, the extended high frequency band signal energy of unvoiced speech signal and voiced speech signal can be controlled in a different manner.In voice signal ground unrest reduces, may be different with the NR algorithm of voiced speech signal for unvoiced speech signal.So the judgement of the voiceless sound of robustness is for very important above-mentioned various application.
It is the accuracy of Voiced signal or Unvoiced signal by audio signal classification that the embodiment of the present invention improve before the operation of voice coding, bandwidth expansion and/or speech enhan-cement.Therefore, the embodiment of the present invention can be applied to speech signal coding, speech signal bandwidth expansion, voice signals enhancement and the reduction of voice signal ground unrest.Especially, the embodiment of the present invention is used in the speech coder that ITU-TAMR-WB standard is improved in bandwidth expansion aspect.
Diagram according to the voice signal characteristic of the embodiment of the present invention will be illustrated by Fig. 1 and 2, and these voice signal characteristic are the accuracy of Voiced signal or Unvoiced signal for improving audio signal classification.Estimated speech signal in two kinds of regions: the low-frequency band in following explanation and high frequency band.
Fig. 1 shows and assesses according to the time domain energy of the lowband speech signal of the embodiment of the present invention.
The time domain energy envelope 1101 of low band speech is energy envelope level and smooth in time and comprises the first ground unrest region 1102 and the second ground unrest region 1105 be separated with voiced speech region 1104 by unvoiced speech region 1103.The height of the low frequency unvoiced speech signal in the energy Ratios unvoiced speech region 1103 of the low frequency voiced speech signal in voiced speech region 1104.In addition, low frequency unvoiced speech signal energy higher than or close to the energy of low frequency ambient noise signal.
Fig. 2 shows and assesses according to the time domain energy of the high-band speech signal of the embodiment of the present invention.
Compared to Fig. 1, High frequency speech signal has different characteristics.The time domain energy envelope of high-band speech signal 1201, it is energy envelope level and smooth in time, comprises the first ground unrest region 1202 and the second ground unrest region 1205 be separated with voiced speech region 1204 by unvoiced speech region 1203.The energy Ratios high frequency unvoiced speech signal of high frequency voiced speech signal low.Much higher compared to high frequency background noise signal of the energy of high frequency unvoiced speech signal.But the duration of the time length ratio voiced speech 1204 of high frequency unvoiced speech signal 1203 is relatively shorter.
The embodiment of the present invention utilizes this difference of the characteristic between the voiced sound of different frequency bands in time domain and unvoiced speech.Such as, can by determining higher than the energy of Unvoiced signal under low-frequency band instead of high frequency band of correspondence, the energy of signal determines that the signal in present frame is Voiced signal.Similarly, can by determining that the signal that the energy of signal is still determined in present frame higher than the energy of corresponding Voiced signal under high frequency band lower than the energy of corresponding Voiced signal under low-frequency band is Unvoiced signal.
Traditionally, two major parameters are used to detect voiceless sound/voiced speech signal.A parameter representation signal cycle, and another parameter instruction spectral tilt, the degree that when spectral tilt is frequency increase, intensity declines.
A general signal period parameter is provided in formula (1) below
P v o i c i n g 1 = &Sigma; n s w ( n ) &CenterDot; s w ( n - P i t c h ) ( &Sigma; n | s w ( n ) | 2 ) ( &Sigma; n | s w ( n - P i t c h ) | 2 ) = < s w ( n ) , s w ( n - P i t c h ) > | | s w ( n ) | | 2 | | s w ( n - P i t c h ) | | 2 - - - ( 1 )
In formula (1), s wn () is weighted speech signal, molecule is a relative coefficient, and denominator is an energy normalized factor.Cycle parameter is also called as " fundamental tone correlativity " or " voiced sound ".The example of another voiced sound parameter is provided in formula (2) below.
P v o i c n g 2 = &Sigma; n | G p &CenterDot; e p ( n ) | 2 - &Sigma; n | G c &CenterDot; e c ( n ) | 2 &Sigma; n | G p &CenterDot; e p ( n ) | 2 + &Sigma; n | G c &CenterDot; e c ( n ) | 2 = | | G p &CenterDot; e p ( n ) | | 2 - | | G c &CenterDot; e c ( n ) | | 2 | | G p &CenterDot; e p ( n ) | | + | | G c &CenterDot; e c ( n ) | | 2 - - - ( 2 )
In formula (2), e p(n) and e cn () is excitation components signal, and will be further described below.In various applications, some variants of formula (1) and (2) can be used, but they still can represent signal period property.
The most general spectral tilt parameter is provided in formula (3) below.
P t i l t 1 = &Sigma; n s ( n ) &CenterDot; s ( n - 1 ) &Sigma; n | s ( n ) | 2 = < s ( n ) , s ( n - 1 ) > | | s w ( n ) | | 2 - - - ( 3 )
In formula (3), s (n) is voice signal.If frequency domain energy can obtain, then as Suo Shi formula (4), spectral tilt parameter can be described.
P t i l t 2 = E L B - E H B E L B + E H B - - - ( 4 )
In formula (4), E lBlow-frequency band energy, and E hBit is high-band energy.
Can reflect that another parameter of spectral tilt is called as zero-crossing rate (ZCR).ZCR calculates positive/negative signal intensity speed on frame or subframe.Usually, when high-band energy is relative to low-frequency band energy height, ZCR is also high.Otherwise when high-band energy is low relative to low-frequency band energy, ZCR is also low.In actual applications, some variants of formula (3) and (4) can be used, but they still can represent spectral tilt.
As discussed previously, voiceless sound/voiced sound classification or voiceless sound judgement are widely used in speech signal coding, speech signal bandwidth expansion (BWE), voice signals enhancement and voice signal ground unrest and reduce (NR) field.
In voice coding, as by illustrating subsequently, can encode to unvoiced speech signal by using noisy-type excitation, and impulse-type excitation can be utilized to encode to voiced speech signal.In speech signal bandwidth expansion, the extended high frequency band signal energy of unvoiced speech signal may increase, and the extended high frequency band signal energy of voiced speech signal may reduce.Reduce in (NR) at voice signal ground unrest, the NR algorithm for unvoiced speech signal is so not radical, and comparatively radical for the NR algorithm of voiced speech signal.So the voiceless sound of robustness or voiced sound judgement are for very important above-mentioned various application.Based on the characteristic of unvoiced speech and voiced speech, cycle parameter P voicingwith spectral tilt parameter P tiltor their variant parameter great majority are all for detecting voiceless sound/voiced sound classification.But the present inventor finds cycle parameter P voicingwith spectral tilt parameter P tiltor " definitely " value of their variant parameters can be subject to the impact of voice signal recording unit, background noise level and/or speaker.These impacts are difficult to pre-determine, and likely cause the voiceless sound/voiced speech of non-robust to detect.
The embodiment of the present invention describes a kind of voiceless sound of improvement/voiced speech and detects, its life cycle parameter P voicingwith spectral tilt parameter P tiltor " relatively " value of their variant parameters instead of " definitely " value." relatively " value than the much less of " definitely " value, thus causes the robustness of voiceless sound/voiced speech detection better by the impact of voice signal recording unit, background noise level and/or speaker.
Such as, the definition of a combination voiceless sound parameter can as shown in formula (5) below.
P c_unvoicing=(1-P voicing)·(1-P tilt)·····(5)
Multiple points at formula (11) end place show to add other parameter.Work as P c_unvoicing" definitely " value become large time, it is likely unvoiced speech signal.Combination voiced sound parameter can be described as Suo Shi formula (6) below.
P c_voicing=P voicing·P tilt·····(6)
Multiple points at formula (6) end place show to add other parameter.Work as P c_voicing" absolute value " when becoming large, it is likely voiced speech signal.At definition P c_unvoicingor P c_voicing" relatively " value before, first define P c_unvoicingor P c_voicingstrong level and smooth after parameter.Such as, as described in the inequality in formula (7) below, the parameter for present frame can obtain by after former frame parameter smoothing.
In formula (7), P c_unvoicing_smp c_unvoicingstrong level and smooth after value.
Similarly, the inequality in formula below (8) can be used to determine the combination voiced sound parameter P smoothly c_voicing_sm.
Herein, in formula (8), P c_voicing_smp c_voicingstrong level and smooth after value.
The statistical nature of voiced speech is different from the statistical nature of unvoiced speech, therefore, in various embodiments, the parameter of above-mentioned inequality can be determined to determine (such as, 0.9,0.99,7/8,255/256) and experimentally become more meticulous further where necessary.
The P of definition shown in formula (9) that can be as described below and (10) c_unvoicingor P c_voicing" relatively " value.
P c_unvoicing_diff=P c_unvoicing-P c_unvoicing_sm(9)
P c_unvoicing_diffp c_unvoicing" relatively " value; Similarly,
P c_voicing_diff=P c_voicing-P c_voicing_sm(10)
P c_voicing_diffp c_voicing" relatively " value.
Inequality is below the example embodiment adopting voiceless sound to detect.In this example embodiment, mark Unvoiced_flag being set to TURE deictic word tone signal is unvoiced speech, and mark Unvoiced_flag to be set to FALSE deictic word tone signal be not unvoiced speech.
Inequality is below the alternative exemplary embodiment adopting voiced sound to detect.In this example embodiment, Voiced_flag being set to TRUE deictic word tone signal is voiced speech, and Voiced_flag to be set to FALSE deictic word tone signal be not voiced speech.
Determine voice signal be from VOICED class after, the time domain coding methods such as CELP can be utilized subsequently to encode to voice signal.The embodiment of the present invention can also be applied to and before the coding UNVOICED signal is re-classified as VOICED signal.
In various embodiments, the voiceless sound/voiced sound detection algorithm of above-mentioned improvement can be used to improve AMR-WB-BWE and NR.
Fig. 3 shows the operation performed during using the traditional CELP encoders implementing the embodiment of the present invention to encode to raw tone.
Fig. 3 shows the initial celp coder of tradition, wherein usually by using analysis-by-synthesis approach to make the weighted error 109 between synthetic speech 102 and raw tone 101 minimize, this means to decode (synthesis) signal to perform coding (analysis) by sensing and optimizing in a closed loop.
This is true for the waveform of the ultimate principle that all speech coders utilizes to be voice signal be height correlation.As explanation, autoregression (AR) the model representation voice shown in following formula (11) can be used.
X n = &Sigma; i = 1 L a i X n - 1 + e n - - - ( 11 )
In formula (11), each sample is represented as the linear combination that a front L sample adds white noise.Weighting coefficient a 1, a 2a lbe called as linear predictor coefficient (LPC).For each frame, select weighting coefficient a 1, a 2a l, make the frequency spectrum { X using above-mentioned model generation 1, X 2x nmost mate the frequency spectrum inputting speech frame.
Alternatively, voice signal can also be represented by the combination of harmonic-model and noise model.The Fourier series that the harmonic of model is actually the cyclical component of signal represents.Generally speaking, for Voiced signal, the harmonic wave plus noise model of voice is mixed by harmonic wave and noise.Harmonic wave in voiced speech and the ratio of noise depend on multiple factor, comprise speaker's feature (such as, the sound of speaker is normal or picture breathing in which degree); Sound bite feature (such as, sound bite is periodic in which degree) and frequency.The upper frequency of voiced speech has the noisy-type component of higher proportion.
Linear prediction model and harmonic wave noise model are two main method for carrying out modeling and coding to voice signal.Linear prediction model is particularly good at and carries out modeling to the spectrum envelope of voice, and harmonic noise model is good at and is carried out modeling to the fine structure of voice.These two methods can be combined to utilize their respective advantages.
As indicated previously, before carrying out CELP coding, such as, with the speed of 8000 samples per second, filtering and sampling are carried out to the input signal arriving mobile microphone.Subsequently, each sample 13 bits are such as utilized to quantize each sample.The voice segment of sampling is become fragment or the frame (such as, when 160 samples) of 20ms.
Analyzing speech signal, and the LP model, pumping signal and the fundamental tone that extract it.The spectrum envelope of LP model representation voice.It is switched to one group of line spectral frequencies (LSF) coefficient, and it is the substituting expression of linear forecasting parameter, because LSF coefficient has good quantized character.Scalar quantization can be carried out to LSF coefficient, or more efficiently, the LSF vector code book of training in advance can be used to carry out vector quantization to them.
Code excited comprises the code book containing code vector, and these code vectors have all independent components selected, and make each code vector can have approximate ' in vain ' frequency spectrum.For each subframe of input voice, by short-term linear prediction filter 103 and long-term prediction filter 105, filtering is carried out to each code vector, and output and speech samples are compared.At each subframe place, select to export and mate best code vector to represent this subframe with input voice (minimized error).
Code-excited 108 generally include pulse type signal or noisy-type signal, and these mathematically build or preserve in the codebook.This code book can be used for scrambler and take over party's demoder.Code-excited 108, it can be random or fixed codebook, can be the vector quantization dictionary that (implicit expression or explicit) is hard coded into codec.This kind of fixed codebook can be Algebraic Code Excited Linear Prediction or can explicit storage.
Code vector in code book is multiplied by suitable Gain tuning with the energy making energy equal to input voice.Correspondingly, the output of code-excited 108 was multiplied by gain G before entering linear filter c107.
' in vain ' frequency spectrum of short-term linear prediction filter 103 pairs of code vectors carries out shaping to be similar to the frequency spectrum of input voice.Similarly, in the time domain, short-term correlation coefficient (with the correlativity at first sample) is incorporated in white sequence by short-term linear prediction filter 103.Wave filter excitation being carried out to shaping has the all-pole modeling (short-term linear prediction filter 103) that form is 1/A (z), wherein A (z) is called as predictive filter and obtains by linear prediction (such as, Paul levinson-moral guest algorithm).In one or more embodiments, can all-pole filter be used, because it can show human vocal tract well, and be easy to calculate.
Short-term linear prediction filter 103 can be obtained by analyzing original signal 101 and be represented by one group of coefficient:
A ( z ) = &Sigma; i = 1 P 1 + a i &CenterDot; z - i , i = 1 , 2 , .... , P - - - ( 12 )
As discussed previously, the region display long periodicity of voiced speech.In this cycle, be called fundamental tone, be incorporated in synthesis frequency spectrum by pitch filter 1/ (B (z)).Fundamental tone and pitch gain are depended in the output of long-term prediction filter 105.In one or more embodiments, this fundamental tone can be estimated from original signal, residual signals or weighting original signal.In one embodiment, following formula (13) can be used to represent long-term forecasting function (B (z)).
B(z)=1-G p·z -Pitch(13)
Weighting filter 110 is relevant with above-mentioned short-term prediction filter.Can as described in formula (14) one of them typical weighting filter of expression.
W ( z ) = A ( z / &alpha; ) 1 - &beta; &CenterDot; z - 1 - - - ( 14 )
Wherein β < α, 0 < β < 1,0 < α≤1.
In another embodiment, shown in an embodiment in following formula (15), weighting filter W (z) can be drawn by utilized bandwidth expansion from LPC wave filter.
W ( z ) = A ( z / &gamma; 1 ) A ( z / &gamma; 2 ) - - - ( 15 )
In formula (15), γ 1> γ 2, they are limit factors to initial point movement.
Correspondingly, for each frame of voice, calculate LPC and fundamental tone, and upgrade wave filter.For each subframe of voice, the code vector of the output producing ' the best ' filtering is selected to represent subframe.Quantized value corresponding to gain must be transferred to demoder to carry out suitable decoding.LPC and pitch value also must carry out quantizing and every frame sends so that at demoder place reconfigurable filter.Correspondingly, code-excited index, quantification gain index, quantification long-term forecasting parameter reference and quantification short-term forecasting parameter reference are also transferred to demoder.
Fig. 4 shows the operation performed during using CELP decoder to decode to raw tone according to the embodiment of the present invention.
Corresponding wave filter is passed through at demoder place reconstructed speech signal by the code vector that will receive.Therefore, each piece except the aftertreatment identical definition had as described in the scrambler of Fig. 3.
Receive at receiver equipment place and untie 80 encoded CELP bit streams.For each subframe received, use the code-excited index, quantification gain index, quantification long-term forecasting parameter reference and the quantification short-term forecasting parameter reference that receive by corresponding demoder, such as, gain decoder 81, long-term forecasting demoder 82 and short-term forecasting demoder 83 obtain corresponding parameter.Such as, the position of driving pulse and the algebraic code vector of range signal and code excited 402 can be determined from the code-excited index received.
With reference to figure 4, demoder is the combination of some pieces, and this demoder comprises code-excited 201, long-term forecasting 203, short-term forecasting 205.Initial demoder also comprises the aftertreatment block 207 after synthetic speech 206.Aftertreatment also can comprise short-term aftertreatment and long-term aftertreatment.
Fig. 5 shows at the traditional CELP encoders implementing to use in the embodiment of the present invention.
Fig. 5 shows the basic celp coder using extra adaptive codebook to predict for improvement of long-term linearity.Produce excitation by the contribution of adaptive codebook 307 and code excited 308 being added, code excited 308 can be random or fixed codebook as discussed previously.Entry in adaptive codebook comprises the time delay version of excitation.This makes it likely efficiently to cyclical signal, and such as voiced sound, encodes.
With reference to figure 5, adaptive codebook 307 comprises synthesis excitation 304 in the past or in pitch period, repeated the circulation of de-energisation fundamental tone.When pitch delay very large or very long time, it can be encoded to round values.When pitch delay is very little or very in short-term, usually it is encoded to fractional value more accurately.Use the periodical information of fundamental tone to generate the self-adaptation component of excitation.This excitation components is subsequently by gain G p305 (being also called pitch gain) adjusts.
Long-term forecasting is extremely important for voiced speech coding, because voiced speech has the strong cycle.The adjacent pitch period of voiced speech is similar each other, this means mathematically, the pitch gain G in excitation expression below pvery high or close to 1.Being activated in formula (16) of gained can be expressed as the combination of each excitation.
e(n)=G p·e p(n)+G c·e c(n)(16)
Wherein, e pn a subframe of () to be index be sample sequence of n, comes from adaptive codebook 307, it comprises crosses de-energisation 304 through backfeed loop (Fig. 5).E pn () can low-pass filtering be low-frequency region adaptively, many usually than high-frequency region of the cycle of this low-frequency region and harmonic wave.E cn (), from code-excited code book 308 (being also called fixed codebook), it is current excitations contribution.In addition, such as pass through to use high-pass filtering enhancing, fundamental tone enhancing, dispersion enhancing, resonance peak strengthens and other can also strengthen e c(n).
For voiced speech, the e in adaptive codebook 307 pn the contribution of () may be leading, and pitch gain G pthe value of 305 is about 1.The excitation of each subframe of usual renewal.Typical frame sign is 20 milliseconds, and typical subframe size is 5 milliseconds.
As described in Figure 3, regular coding excitation 308 is entering between linear filter by gain G c306 adjustment.By added together for the excitation components of two adjustment in constant codebook excitations 108 and adaptive codebook 307 before carrying out filtering by short-term linear prediction filter 303.Quantize this two gain (G pand G c) and to decoder transfers.Correspondingly, to take over party's audio frequency apparatus transfer encoding excitation index, adaptive codebook index, quantification gain index and quantification short-term forecasting parameter reference.
Receiving at receiver equipment place uses the equipment shown in Fig. 5 to carry out the CELP bit stream of encoding.Fig. 6 shows the corresponding demoder of receiver equipment.
Fig. 6 shows the basic CELP decoder corresponding to the scrambler in Fig. 5 according to the embodiment of the present invention.Fig. 6 comprises the aftertreatment block 408 received from the synthetic speech 407 of main demoder.This decoder class is similar to Fig. 2, except adaptive codebook 307.
For each subframe received, use the code-excited index, quantization encoding excitation gain index, quantification fundamental tone index, quantification adaptive codebook gain index and the quantification short-term forecasting parameter reference that receive to pass through corresponding demoder, such as, gain decoder 81, fundamental tone demoder 84, adaptive codebook gain demoder 85 and short-term forecasting demoder 83 obtain corresponding parameter.
In various embodiments, CELP decoder is the combination of some pieces and comprises code-excited 402, adaptive codebook 401, short-term forecasting 406 and preprocessor 408.Except aftertreatment, each piece of identical definition had as described in the scrambler of Fig. 5.Aftertreatment also can comprise short-term aftertreatment and long-term aftertreatment.
As previously mentioned, CELP is mainly used in by benefiting from concrete human sound feature or mankind's voice sonification model is encoded to voice signal.In order to more efficiently encode to voice signal, can be inhomogeneity by classification of speech signals, and in a different manner every class be encoded.Voiced/unvoiced classification or voiceless sound judgement may be the important and basic classification of one in all inhomogeneous all classification.For every class, LPC or STP wave filter is often used to represent spectrum envelope.But may be different to the excitation of LPC wave filter.Unvoiced signal can utilize noisy-type to encourage and encode.On the other hand, Voiced signal can utilize impulse-type excitation to encode.
Code excited block (with reference to 402 in the label 308 in figure 5 and Fig. 6) shows the position of fixed codebook (FCB) to carry out general CELP coding.The code vector selected from FCB is by usually showing for G cthe Gain tuning of 306.
Fig. 7 shows the noisy-type candidate vector for the code-excited code book or fixed codebook building CELP voice coding.
The FCB comprising noisy-type vector may be the optimum structure of Unvoiced signal from the angle of perceived quality.This is because adaptive codebook contribution or LTP contribution will be very little or do not exist, and the FCB component of voiceless sound class signal is depended in main excitation contribution.In this case, if use impulse type FCB, then export synthetic speech signal and can sound very sharp-pointed, because have a lot of zero in the code vector selected from the impulse type FCB being designed for low rate encoding.
With reference to figure 7, the FCB structure comprising noisy-type candidate vector is code-excited for building.Noisy-type FCB501 selects specific noisy-type code vector 502, and it is multiplied by 503 adjustment.
Fig. 8 shows the impulse type candidate vector for the code-excited code book or fixed codebook building CELP voice coding.
From the angle of perception, the impulse type FCB of voiced sound class signal provides better quality than noisy-type FCB.This is because adaptive codebook contribution or LTP contribution will be mainly used in high periodically voiced sound class voice, and main excitation contribution does not rely on the FCB component of voiced sound class signal.If use noisy-type FCB, then export synthetic speech signal and may sound like noise or periodically less, because it is more difficult by using the code vector selected from the noisy-type FCB being designed for low rate encoding to obtain good Waveform Matching.
Can comprise for building code-excited multiple impulse type candidate vectors with reference to figure 8, FCB structure.。Impulse type code vector 602 be select from impulse type FCB601 and be multiplied by gain 603.
Fig. 9 shows the example of the excitation spectrum of voiced speech.After removing LPC spectrum envelope 704, excitation spectrum 702 is almost flat.Many usually than highband spectral 703 of the harmonic wave of lower band excitation frequency spectrum 701.In theory, the energy level of idealized or non-quantized high band excitation frequency spectrum may almost identical with lower band excitation frequency spectrum.In fact, if utilize CELP technology all to encode to low-frequency band and high frequency band, then the energy level of synthesis or quantification highband spectral than synthesis or may quantize the low of low-frequency band frequency spectrum, and reason has 2 points.The first, closed loop CELP coding emphasizes low-frequency band instead of high frequency band more.The second, the Waveform Matching of low band signal is easier than high frequency band, not only because high-frequency band signals change is than very fast but also because high-frequency band signals has more as the feature of noise.
Encode at low bit rate CELP, such as, in AMR-WB, usually high frequency band is not encoded, but utilize bandwidth expansion (BWE) technology to generate high frequency band in a decoder.In this case, high band excitation frequency spectrum can copy from lower band excitation frequency spectrum simply, adds some random noises simultaneously.From low-frequency band spectrum energy enveloping estimation or highband spectral energy envelope can be estimated.Suitably control high-frequency band signals energy and become very important when using BWE.Different from unvoiced speech signal, the energy of the high frequency band voiced speech signal of generation must suitably reduce to obtain best perceived quality.
Figure 10 shows the example of the excitation spectrum of unvoiced speech.
When unvoiced speech, excitation spectrum 802 is almost flat after removing LPC spectrum envelope 804.Lower band excitation frequency spectrum 801 and highband spectral 803 are all noisy-type.In theory, the energy level of idealized or non-quantized high band excitation frequency spectrum may almost identical with lower band excitation frequency spectrum.In fact, if utilize CELP technology all to encode to low-frequency band and high frequency band, then the energy level of synthesis or quantification highband spectral with synthesis or may quantize the identical of low-frequency band frequency spectrum or higher a little, and reason has 2 points.The first, closed loop CELP coding emphasizes higher-energy region more.The second, although the Waveform Matching of low band signal is easier than high-frequency band signals, for noisy-type signal, be difficult to obtain good Waveform Matching.
Be similar to voiced speech coding, the CELP for voiceless sound low bit rate encodes, and such as AMR-WB, does not encode to high frequency band usually, but utilizes BWE technology to generate high frequency band in a decoder.In this case, voiceless sound high band excitation frequency spectrum can copy from voiceless sound lower band excitation frequency spectrum simply, adds some random noises simultaneously.Can from low-frequency band spectrum energy enveloping estimation or the highband spectral energy envelope estimating unvoiced speech signal.The energy suitably controlling voiceless sound high-frequency band signals is even more important when using BWE.Different from voiced speech signal, the energy of the high frequency band unvoiced speech signal of generation preferably suitably increases to obtain best perceived quality.
Figure 11 shows the example of the excitation spectrum of ambient noise signal.
Excitation spectrum 902 is almost flat after removing LPC spectrum envelope 904.Lower band excitation frequency spectrum 901, it is noisy-type normally, as highband spectral 903.In theory, the idealized or non-quantized high band excitation frequency spectrum of ambient noise signal may have the energy level almost identical with lower band excitation frequency spectrum.In fact, if utilize CELP technology all to encode to low-frequency band and high frequency band, then the synthesis of ambient noise signal or the energy level of quantification highband spectral than synthesis or may quantize the low of low-frequency band frequency spectrum, and reason has 2 points.The first, closed loop CELP coding emphasizes the low-frequency band higher than the energy of high frequency band more.The second, the Waveform Matching of low band signal is easier than high-frequency band signals.Be similar to voice coding, the low bit speed rate CELP for ambient noise signal encodes, and does not usually encode to high frequency band, but utilizes BWE technology to generate high frequency band in a decoder.In this case, the high band excitation frequency spectrum of ambient noise signal can copy from lower band excitation frequency spectrum simply, adds some random noises simultaneously; Can from the highband spectral energy envelope of low-frequency band spectrum energy enveloping estimation or estimating background noise comprising signal.Control high frequency band ambient noise signal and may be different from voice signal when using BWE.Different from voice signal, As time goes on the energy of the high frequency band ground unrest voice signal of generation preferably keeps stable to realize optimal perceptual quality.
Figure 12 A and 12B shows the example of the Frequency Domain Coding/decoding utilizing bandwidth expansion.Figure 12 A shows the scrambler with BWE side information, and Figure 12 B shows the demoder with BWE.
First with reference to figure 12A, in a frequency domain low band signal 1001 is encoded by using low-frequency band parameter 1002.Quantize low-frequency band parameter 1002, and by bit stream channel 1003 to take over party's audio frequency access device transmission quantization index.A small amount of bit is used to encode to the high-frequency band signals extracted from sound signal 1004 by using high frequency band edge parameter 1005.By the high frequency band edge parameter (HB side information index) that bit stream channel 1006 quantizes to take over party's audio frequency access device transmission.
With reference to figure 12B, at demoder place, low-frequency band bit stream 1007 is used to produce decoded low band signal 1008.High frequency band edge bit stream 1010 is for decoding and generate high frequency band edge parameter 1011.High-frequency band signals 1012 is generated from low band signal 1008 under the help of high frequency band edge parameter 1011.Final sound signal 1009 is produced by combination low band signal and high-frequency band signals.Frequency domain BWE also needs the high-frequency band signals to generating to carry out suitable energy hole.For voiceless sound, voiced sound and noise signal, different energy levels can be set.So the high-quality classification of voice signal needs frequency domain BWE equally.
The correlative detail that ground unrest reduces algorithm is hereafter being described.Generally speaking, because unvoiced speech signal is noisy-type, so the ground unrest in voiceless sound district reduces (NR) should not have the radical of dullness area, benefit from noise mask process impact.In other words, the ground unrest of same stages more can be heard in dullness area than in voiceless sound district, NR should be compared in dullness area radical in voiceless sound district.In such cases, need high-quality voiceless sound/voiced sound judgement.
Generally speaking, unvoiced speech signal is noisy-type signal, and it is not periodically.In addition, unvoiced speech signal in high-frequency region than at low-frequency region, there is more energy.By contrast, voiced speech signal has contrary feature.Such as, voiced speech signal is a kind of signal of quasi periodic type, and this signal has more energy (see also Fig. 9 and 10) at low-frequency region than in high-frequency region usually.
Figure 13 A to 13C is the schematic diagram of the speech processes of the various embodiments using above-mentioned speech processes.
With reference to figure 13A, a kind of method of speech processing comprises the multiple frames (square frame 1310) receiving pending voice signal.In various embodiments, multiple frames of voice signal can generate, as microphone etc. in same audio frequency apparatus.In an optional embodiment, exemplarily, can at audio frequency apparatus place received speech signal.Such as, can encode to voice signal or decode subsequently.For each frame, determine the voiceless sound/voiced sound parameter (square frame 1312) reflecting voiceless sound/voiced speech feature in present frame.In various embodiments, voiceless sound/voiced sound parameter can comprise cycle parameter, spectral tilt parameter, or other variant.The method also comprises determines the voiceless sound parameter smoothly, and the voiceless sound parameter after this is level and smooth comprises the information (square frame 1314) of the voiceless sound/voiced sound parameter in the previous frame of voice signal.Obtain voiceless sound/voiced sound parameter and level and smooth after voiceless sound/voiced sound parameter between difference (square frame 1316).Alternatively, voiceless sound/voiced sound parameter can be obtained and level and smooth after voiceless sound/voiced sound parameter between relative value (such as, ratio).When determining whether present frame is more suitable for being treated to voiceless sound/voiced speech, the difference determined is used to make voiceless sound/voiced sound judgement (square frame 1318) as decision parameter.
With reference to figure 13B, a kind of method of speech processing comprises multiple frames (square frame 1320) of received speech signal.Here use voiced sound parameter to describe this embodiment, but use voiceless sound parameter to be suitable for too.For each frame determines combination voiced sound parameter (square frame 1322).In one or more embodiments, combining voiced sound parameter can be cyclic parameter and tilt parameters and level and smooth combination voiced sound parameter.Can by the combination of smoothing processing on one or more previous frames of voice signal voiced sound parameter to obtain level and smooth combination voiced sound parameter.Combination voiced sound parameter and level and smooth combination voiced sound parameter are compared (square frame 1324).The comparative result in judgement is used present frame to be categorized as VOICED voice signal or UNVOICED voice signal (square frame 1326).Can process according to the classification of the voice signal determined, such as, coding or decoding, voice signal (square frame 1328).
Then with reference to figure 13C, in another example embodiment, a kind of method of speech processing comprises multiple frames (square frame 1330) of received speech signal.Determine voice signal the first energy envelope in the time domain (square frame 1332).Can at the first frequency band, such as, determine the first energy envelope in the low-frequency band reaching 4000Hz etc.Previous frame can be used to determine the low-frequency band energy smoothly from the first energy envelope.The low-frequency band energy calculating voice signal with level and smooth after the difference of low-frequency band energy or the first ratio (square frame 1334).Determine voice signal the second energy envelope in the time domain (square frame 1336).The second energy envelope is determined in the second frequency band.Second frequency band is different from the first frequency band.Such as, the second frequency band may be high frequency band.In one example, the second frequency band may between 4000Hz and 8000Hz.One or more previous frames based on voice signal calculate the high-band energy smoothly.Use the second energy envelope determination difference or second ratio (square frame 1338) of every frame.High-band energy that the second ratio is voice signal in present frame can be calculated and level and smooth after high-band energy between ratio.The first ratio in judgement and the second ratio is used present frame to be categorized as VOICED voice signal or UNVOICED voice signal (square frame 1340).Classification according to the voice signal determined can process, such as, and coding or decoding, sorted voice signal (square frame 1342).
In one or more embodiments, when determining that voice signal is UNVOICED voice signal, use noisy-type excitation to carry out coding/decoding to voice signal, and when determining that voice signal is VOICED signal, use impulse-type excitation to carry out coding/decoding to voice signal.
In other embodiments, when determining that voice signal is UNVOICED signal, in a frequency domain coding/decoding is carried out to voice signal, and when determining that voice signal is VOICED signal, in the time domain coding/decoding is carried out to voice signal.
Correspondingly, the embodiment of the present invention can be used for the voiceless sound/voiced sound judgement improving voice coding, bandwidth expansion and/or speech enhan-cement.
Figure 14 shows the communication system 10 according to the embodiment of the present invention.
Communication system 10 has the audio frequency access device 7 and 8 being coupled to network 36 via communication link 38 and 40.In one embodiment, audio frequency access device 7 and 8 is IP-based voice transfer (VOIP) equipment, and network 36 is wide area network (WAN), PSTN (PSTB) and/or internet.In another embodiment, communication link 38 and 40 is wired and/or WiMAX is connected.In another optional embodiment, audio frequency access device 7 and 8 is honeycomb or mobile phone, and link 38 and 40 is mobile phone channels, and network 36 represents mobile telephone network.
Audio frequency access device 7 uses microphone 12 by sound, and the sound of such as music or people is converted to analogue audio frequency input signal 28.Analogue audio frequency input signal 28 is converted to digital audio and video signals 33 to be input in the scrambler 22 of codec 20 by microphone interface 16.According to the embodiment of the present invention, scrambler 22 produces encoded audio signal TX to transmit to network 26 via network interface 26.Demoder 24 in codec 20 receives the encoded audio signal RX of automatic network 36 via network interface 26, and converts encoded audio signal RX to digital audio and video signals 34.Digital audio and video signals 34 is converted to the sound signal 30 being applicable to drive loudspeaker 14 by speaker interface 18.
In embodiments of the present invention, when audio frequency access device 7 is VOIP equipment, the some or all of parts in audio frequency access device 7 are implemented in mobile phone.But in certain embodiments, microphone 12 and loudspeaker 14 are independent unit, and microphone interface 16, speaker interface 18, codec 20 and network interface 26 are implemented in personal computer.Codec 20 can by the implement software operated on computing machine or application specific processor or by specialized hardware, and such as special IC (ASIC) is implemented.Microphone interface 16 is by modulus (A/D) converter, and other interface circuit being positioned at mobile phone and/or computing machine is implemented.Similarly, speaker interface 18 is by digital to analog converter and other interface circuit enforcement being positioned at mobile phone and/or computing machine.In other embodiments, audio frequency access device 7 can by embodied in other known in the art and division.
In embodiments of the present invention, when audio frequency access device 7 be honeycomb or mobile phone time, the element in audio frequency access device 7 is implemented in cellular handset.Codec 20 is by the software that operates on the processor in mobile phone or implemented by specialized hardware.In other embodiments of the invention, audio frequency access device can at such as end-to-end wired and radio digital communication system, such as intercom and wireless phone, and so on miscellaneous equipment in implement.In the application such as consumer audio's equipment, audio frequency access device can comprise the codec only with scrambler 22 or demoder 24, such as, and digital microphone system or music player devices.In other embodiments of the invention, codec 20 can use when not having microphone 12 and loudspeaker 14, such as, use in the cellular basestation of access PSTN.
The method of speech processing for improvement of the classification of voiceless sound/voiced sound described in various embodiments of the invention can be, such as, implements in scrambler 22 or demoder 24.Method of speech processing for improvement of the classification of voiceless sound/voiced sound can be implemented in hardware in various embodiments or software.Such as, scrambler 22 or demoder 24 can be parts for digital signal processing (DSP) chip.
Figure 15 shows the block diagram of disposal system, and this disposal system can be used for realizing equipment disclosed herein and method.Particular device can utilize a part for shown all component or described assembly, and the degree of integration between equipment may be different.In addition, equipment can comprise the Multi-instance of parts, such as multiple processing unit, processor, storer, transmitter, receiver etc.Disposal system can comprise outfit one or more input-output apparatus, the such as processing unit of loudspeaker, microphone, mouse, touch-screen, button, keyboard, printer, display etc.Processing unit can comprise central processing unit (CPU), storer, mass storage facility, video adapter and be connected to the I/O interface of bus.
Bus can be one or more in some bus architectures of any type, comprises memory bus or memory controller, peripheral bus, video bus etc.CPU can comprise the data into electronic data processing of any type.Storer can comprise the system storage of any type, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous dram (SDRAM), ROM (read-only memory) (ROM) or its combination etc.In an embodiment, the program used when the ROM used when storer can be included in start and executive routine and the DRAM of data-carrier store.
Mass storage facility can comprise the memory devices of any type, and it is for storing data, program and out of Memory, and makes these data, program and out of Memory pass through bus access.It is one or more that mass storage facility can comprise in following item: solid magnetic disc, hard disk drive, disc driver, CD drive etc.
Display card and I/O interface provide interface outside input and output device to be coupled on processing unit.As illustrated, the example of input and output device comprises the display be coupled on display card and the mouse/keyboard/printer be coupled on I/O interface.Miscellaneous equipment can be coupled on processing unit, and can utilize extra or less interface card.Such as, can use as interface is supplied to printer by the serial line interfaces such as USB (universal serial bus) (USB) (not shown).
Processing unit also comprises one or more network interface, and described network interface can comprise the such as wire link such as Ethernet cable or its fellow, and/or in order to the wireless link of access node or heterogeneous networks.Network interface allows processing unit via network and remote unit communication.For example, network interface can provide radio communication via one or more transmitter/emitting antenna and one or more receiver/receiving antenna.In one embodiment, processing unit is coupled on LAN (Local Area Network) or wide area network and communicates for data processing and with remote equipment, and described remote equipment is other processing unit, the Internet, remote storage facility or its fellow such as.
Although describe the present invention with reference to an illustrative embodiment, this describes and is not intended to limit the present invention.Those skilled in the art after with reference to this description, will be appreciated that various amendment and the combination of illustrative embodiment, and other embodiments of the invention.Such as, above-mentioned various embodiment can combination with one another.
Although describe in detail the present invention and advantage thereof, it should be understood that, can when do not depart from as appended claims define the spirit and scope of the present invention to the present invention make various change, substitute and change.Such as, discussed above many Characteristic and function can be implemented by software, hardware, firmware or its combination.In addition, scope of the present invention is not limited to the specific embodiment of the process described in instructions, machine, manufacture, material composition, component, method and step.One of ordinary skill in the art can understand easily from the present invention, can be used according to the invention existing or be about to develop, there is the function identical with corresponding embodiment essence described herein, maybe can obtain the process of the result identical with described embodiment essence, machine, manufacture, material composition, component, method or step.Correspondingly, claims scope comprises these flow processs, machine, manufacture, material composition, component, method, and step.

Claims (21)

1. a method of speech processing, is characterized in that, described method comprises:
Determine voiceless sound/voiced sound parameter, described voiceless sound/voiced sound parameter reflects the voiceless sound/voiced speech characteristic comprised in the present frame of the voice signal of multiple frame;
Determine the information of the voiceless sound/voiced sound parameter comprised in the frame before the described present frame of described voice signal level and smooth after voiceless sound/voiced sound parameter;
Calculate described voiceless sound/voiced sound parameter and described level and smooth after voiceless sound/voiced sound parameter between difference; And
The described difference calculated is used as decision parameter to generate voiceless sound/voiced sound determination point, described voiceless sound/voiced sound determination point is used for determining whether described present frame comprises unvoiced speech or voiced speech.
2. method according to claim 1, is characterized in that, described voiceless sound/voiced sound parameter is the combination parameter of at least two characteristics of reflection voiceless sound/voiced speech.
3. method according to claim 2, is characterized in that, described combination parameter is the operation result of cycle parameter and spectral tilt parameter.
4. method according to claim 1, is characterized in that, described voiceless sound/voiced sound parameter is the voiceless sound parameter (P of reflection unvoiced speech characteristic unvoicing), and described level and smooth after voiceless sound/voiced sound parameter be level and smooth after voiceless sound parameter (P unvoicing_sm).
5. method according to claim 4, is characterized in that: when described voiceless sound parameter and described level and smooth after voiceless sound parameter between difference be greater than 0.1 time, determining that described voiceless sound/voiced sound judgement comprises the described present frame determining described voice signal is Unvoiced signal; When described voiceless sound parameter and described level and smooth after voiceless sound parameter between difference be less than 0.05 time, determine that the described present frame of described voice signal is not unvoiced speech.
6. method according to claim 5, it is characterized in that, when described voiceless sound parameter and described level and smooth after voiceless sound parameter between difference between 0.05 to 0.1 time, determine that described voiceless sound/voiced sound judgement comprises and determine that the described present frame of described voice signal has identical sound-type with described former frame.
7. method according to claim 4, is characterized in that, described level and smooth after voiceless sound parameter calculate from described voiceless sound parameter, as follows.
8. method according to claim 1, is characterized in that, described voiceless sound/voiced sound parameter is the voiced sound parameter (P of reflection voiced speech characteristic voicing), and described level and smooth after voiceless sound/voiced sound parameter be level and smooth after voiced sound parameter (P unvoicing_sm).
9. method according to claim 8, is characterized in that: when described voiced sound parameter and described level and smooth after voiced sound parameter between difference be greater than 0.1 time, determining that described voiceless sound/voiced sound judgement comprises the described present frame determining described voice signal is Voiced signal; And when described voiceless sound parameter and described level and smooth after voiceless sound parameter between difference be less than 0.05 time, determine that the described present frame of described voice signal is not voiced speech.
10. method according to claim 8, is characterized in that, described level and smooth after voiced sound parameter calculate from described voiced sound parameter, as follows.
11. methods according to any claim in claim 1 to 10, it is characterized in that, determine that the voiceless sound/voiced sound parameter of the voiceless sound/voiced speech characteristic reflected in present frame comprises the second energy envelope in the first energy envelope in the first frequency band of determining in described voice signal time domain and the second different frequency band in described voice signal time domain.
12. methods according to claim 11, is characterized in that, described second band ratio first frequency band is high.
13. 1 kinds of voice processing apparatus, is characterized in that, comprising:
Processor; And
Store the computer-readable recording medium of the program performed by described processor, described program comprises the instruction performing following operation:
Determine voiceless sound/voiced sound parameter, described voiceless sound/voiced sound parameter reflects the voiceless sound/voiced speech characteristic comprised in the present frame of the voice signal of multiple frame,
Determine the information of the voiceless sound/voiced sound parameter comprised in the frame before the described present frame of described voice signal level and smooth after voiceless sound/voiced sound parameter,
Calculate described voiceless sound/voiced sound parameter and described level and smooth after voiceless sound/voiced sound parameter between difference; And
The described difference calculated is used as decision parameter to generate voiceless sound/voiced sound determination point, described voiceless sound/voiced sound determination point is used for determining whether described present frame comprises unvoiced speech or voiced speech.
14. devices according to claim 13, is characterized in that, described voiceless sound/voiced sound parameter is the combination parameter of the operation result of reflection cycle parameter and spectral tilt parameter.
15. devices according to claim 13, it is characterized in that: when described voiceless sound/voiced sound parameter and described level and smooth after voiceless sound/voiced sound parameter between difference be greater than 0.1 time, determine that described voiceless sound/voiced sound judgement comprises and determine that the described present frame of described voice signal is voiceless sound/Voiced signal; When described voiceless sound/voiced sound parameter and described level and smooth after voiceless sound/voiced sound parameter between difference be less than 0.05 time, determine that the described present frame of described voice signal is not voiceless sound/voiced speech.
16. devices according to claim 13, is characterized in that, described voiceless sound/voiced sound parameter is the voiceless sound parameter of reflection unvoiced speech characteristic, and described level and smooth after voiceless sound/voiced sound parameter be level and smooth after voiceless sound parameter.
17. devices according to claim 13, is characterized in that, described voiceless sound/voiced sound parameter is the voiced sound parameter of reflection voiced speech characteristic, and described level and smooth after voiceless sound/voiced sound parameter be level and smooth voiced sound parameter.
18. according to claim 13 to the device described in any claim in 17, it is characterized in that, determine that the voiceless sound/voiced sound parameter of the voiceless sound/voiced speech feature reflected in present frame comprises the second energy envelope in the second frequency bands different in the first energy envelope in the first frequency band of determining in described voice signal time domain and described voice signal time domain.
19. devices according to claim 18, is characterized in that, described in described second band ratio, the first frequency band is high.
20. 1 kinds of method of speech processing, is characterized in that, described method comprises:
For the present frame of voice signal, determine the first parameter from the first frequency band in the first energy envelope described voice signal time domain, and determine the second parameter from the second frequency band in the second energy envelope described voice signal time domain;
Frame before the described present frame of described voice signal, determine level and smooth after the first parameter and level and smooth after the second parameter;
By described first parameter and described level and smooth after the first parameter compare and by described second parameter with described smoothly after the second parameter compare; And
Described comparative result is used as decision parameter to generate voiceless sound/voiced sound determination point, described voiceless sound/voiced sound determination point is used for determining whether described present frame comprises unvoiced speech or voiced speech.
21. methods according to claim 20, is characterized in that, described in described second band ratio, the first frequency band is high.
CN201480038204.2A 2013-09-09 2014-09-05 The voiceless sound of speech processes/voiced sound decision method and device Active CN105359211B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910358523.6A CN110097896B (en) 2013-09-09 2014-09-05 Voiced and unvoiced sound judgment method and device for voice processing

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US201361875198P 2013-09-09 2013-09-09
US61/875,198 2013-09-09
US14/476,547 US9570093B2 (en) 2013-09-09 2014-09-03 Unvoiced/voiced decision for speech processing
US14/476,547 2014-09-03
PCT/CN2014/086058 WO2015032351A1 (en) 2013-09-09 2014-09-05 Unvoiced/voiced decision for speech processing

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201910358523.6A Division CN110097896B (en) 2013-09-09 2014-09-05 Voiced and unvoiced sound judgment method and device for voice processing

Publications (2)

Publication Number Publication Date
CN105359211A true CN105359211A (en) 2016-02-24
CN105359211B CN105359211B (en) 2019-08-13

Family

ID=52626401

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201910358523.6A Active CN110097896B (en) 2013-09-09 2014-09-05 Voiced and unvoiced sound judgment method and device for voice processing
CN201480038204.2A Active CN105359211B (en) 2013-09-09 2014-09-05 The voiceless sound of speech processes/voiced sound decision method and device

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201910358523.6A Active CN110097896B (en) 2013-09-09 2014-09-05 Voiced and unvoiced sound judgment method and device for voice processing

Country Status (16)

Country Link
US (4) US9570093B2 (en)
EP (2) EP3005364B1 (en)
JP (2) JP6291053B2 (en)
KR (3) KR101892662B1 (en)
CN (2) CN110097896B (en)
AU (1) AU2014317525B2 (en)
BR (1) BR112016004544B1 (en)
CA (1) CA2918345C (en)
ES (2) ES2908183T3 (en)
HK (1) HK1216450A1 (en)
MX (1) MX352154B (en)
MY (1) MY185546A (en)
RU (1) RU2636685C2 (en)
SG (2) SG10201701527SA (en)
WO (1) WO2015032351A1 (en)
ZA (1) ZA201600234B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109119094A (en) * 2018-07-25 2019-01-01 苏州大学 A kind of voice classification method using vocal cords modeling inversion
CN112885380A (en) * 2021-01-26 2021-06-01 腾讯音乐娱乐科技(深圳)有限公司 Method, device, equipment and medium for detecting unvoiced and voiced sounds

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9570093B2 (en) * 2013-09-09 2017-02-14 Huawei Technologies Co., Ltd. Unvoiced/voiced decision for speech processing
US9972334B2 (en) 2015-09-10 2018-05-15 Qualcomm Incorporated Decoder audio classification
WO2017196422A1 (en) * 2016-05-12 2017-11-16 Nuance Communications, Inc. Voice activity detection feature based on modulation-phase differences
US10249305B2 (en) * 2016-05-19 2019-04-02 Microsoft Technology Licensing, Llc Permutation invariant training for talker-independent multi-talker speech separation
RU2668407C1 (en) * 2017-11-07 2018-09-28 Акционерное общество "Концерн "Созвездие" Method of separation of speech and pause by comparative analysis of interference power values and signal-interference mixture
CN108447506A (en) * 2018-03-06 2018-08-24 深圳市沃特沃德股份有限公司 Method of speech processing and voice processing apparatus
US10957337B2 (en) 2018-04-11 2021-03-23 Microsoft Technology Licensing, Llc Multi-microphone speech separation
WO2021156375A1 (en) * 2020-02-04 2021-08-12 Gn Hearing A/S A method of detecting speech and speech detector for low signal-to-noise ratios

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6453285B1 (en) * 1998-08-21 2002-09-17 Polycom, Inc. Speech activity detector for use in noise reduction system, and methods therefor
CN1470052A (en) * 2000-10-18 2004-01-21 ��˹��ŵ�� High frequency intensifier coding for bandwidth expansion speech coder and decoder
US20050177364A1 (en) * 2002-10-11 2005-08-11 Nokia Corporation Methods and devices for source controlled variable bit-rate wideband speech coding
CN1703737A (en) * 2002-10-11 2005-11-30 诺基亚有限公司 Method for interoperation between adaptive multi-rate wideband (AMR-WB) and multi-mode variable bit-rate wideband (VMR-WB) codecs
CN1909060A (en) * 2005-08-01 2007-02-07 三星电子株式会社 Method and apparatus for extracting voiced/unvoiced classification information
CN101261836A (en) * 2008-04-25 2008-09-10 清华大学 Method for enhancing excitation signal naturalism based on judgment and processing of transition frames
WO2008151408A1 (en) * 2007-06-14 2008-12-18 Voiceage Corporation Device and method for frame erasure concealment in a pcm codec interoperable with the itu-t recommendation g.711
CN101379551A (en) * 2005-12-28 2009-03-04 沃伊斯亚吉公司 Method and device for efficient frame erasure concealment in speech codecs
US7606703B2 (en) * 2000-11-15 2009-10-20 Texas Instruments Incorporated Layered celp system and method with varying perceptual filter or short-term postfilter strengths
US20110313778A1 (en) * 2006-06-21 2011-12-22 Samsung Electronics Co., Ltd Method and apparatus for adaptively encoding and decoding high frequency band
CN102664003A (en) * 2012-04-24 2012-09-12 南京邮电大学 Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM)

Family Cites Families (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5216747A (en) * 1990-09-20 1993-06-01 Digital Voice Systems, Inc. Voiced/unvoiced estimation of an acoustic signal
US5765127A (en) * 1992-03-18 1998-06-09 Sony Corp High efficiency encoding method
JPH06110489A (en) * 1992-09-24 1994-04-22 Nitsuko Corp Device and method for speech signal processing
EP0642251B1 (en) * 1993-09-02 2006-10-18 Infineon Technologies AG Method for the automatic switching of the speech direction and circuit arrangement for implementing the method
JPH07212296A (en) * 1994-01-17 1995-08-11 Japan Radio Co Ltd Vox control communication equipment
US5991725A (en) * 1995-03-07 1999-11-23 Advanced Micro Devices, Inc. System and method for enhanced speech quality in voice storage and retrieval systems
WO1998001847A1 (en) * 1996-07-03 1998-01-15 British Telecommunications Public Limited Company Voice activity detector
TW430778B (en) * 1998-06-15 2001-04-21 Yamaha Corp Voice converter with extraction and modification of attribute data
US6463407B2 (en) * 1998-11-13 2002-10-08 Qualcomm Inc. Low bit-rate coding of unvoiced segments of speech
US6556967B1 (en) * 1999-03-12 2003-04-29 The United States Of America As Represented By The National Security Agency Voice activity detector
US6415029B1 (en) * 1999-05-24 2002-07-02 Motorola, Inc. Echo canceler and double-talk detector for use in a communications unit
JP3454214B2 (en) * 1999-12-22 2003-10-06 三菱電機株式会社 Pulse noise removing apparatus and medium-wave AM broadcast receiver including the same
JP3689616B2 (en) * 2000-04-27 2005-08-31 シャープ株式会社 Voice recognition apparatus, voice recognition method, voice recognition system, and program recording medium
US6640208B1 (en) * 2000-09-12 2003-10-28 Motorola, Inc. Voiced/unvoiced speech classifier
US7171357B2 (en) * 2001-03-21 2007-01-30 Avaya Technology Corp. Voice-activity detection using energy ratios and periodicity
US7519530B2 (en) * 2003-01-09 2009-04-14 Nokia Corporation Audio signal processing
US7698141B2 (en) * 2003-02-28 2010-04-13 Palo Alto Research Center Incorporated Methods, apparatus, and products for automatically managing conversational floors in computer-mediated communications
US7469209B2 (en) * 2003-08-14 2008-12-23 Dilithium Networks Pty Ltd. Method and apparatus for frame classification and rate determination in voice transcoders for telecommunications
KR101008022B1 (en) * 2004-02-10 2011-01-14 삼성전자주식회사 Voiced sound and unvoiced sound detection method and apparatus
JP2007149193A (en) * 2005-11-25 2007-06-14 Toshiba Corp Defect signal generating circuit
JP2007292940A (en) * 2006-04-24 2007-11-08 Toyota Motor Corp Voice recognition device and voice recognition method
US8725499B2 (en) * 2006-07-31 2014-05-13 Qualcomm Incorporated Systems, methods, and apparatus for signal change detection
BRPI0717484B1 (en) * 2006-10-20 2019-05-21 Dolby Laboratories Licensing Corporation METHOD AND APPARATUS FOR PROCESSING AN AUDIO SIGNAL
US7817286B2 (en) * 2006-12-22 2010-10-19 Hitachi Global Storage Technologies Netherlands B.V. Iteration method to improve the fly height measurement accuracy by optical interference method and theoretical pitch and roll effect
US7873114B2 (en) * 2007-03-29 2011-01-18 Motorola Mobility, Inc. Method and apparatus for quickly detecting a presence of abrupt noise and updating a noise estimate
US8990073B2 (en) * 2007-06-22 2015-03-24 Voiceage Corporation Method and device for sound activity detection and sound signal classification
CN101221757B (en) 2008-01-24 2012-02-29 中兴通讯股份有限公司 High-frequency cacophony processing method and analyzing method
US8321214B2 (en) * 2008-06-02 2012-11-27 Qualcomm Incorporated Systems, methods, and apparatus for multichannel signal amplitude balancing
US20110123121A1 (en) * 2009-10-13 2011-05-26 Sony Corporation Method and system for reducing blocking artefacts in compressed images and video signals
CN102884575A (en) * 2010-04-22 2013-01-16 高通股份有限公司 Voice activity detection
TWI403304B (en) * 2010-08-27 2013-08-01 Ind Tech Res Inst Method and mobile device for awareness of linguistic ability
CN102655480B (en) 2011-03-03 2015-12-02 腾讯科技(深圳)有限公司 Similar mail treatment system and method
US8909539B2 (en) 2011-12-07 2014-12-09 Gwangju Institute Of Science And Technology Method and device for extending bandwidth of speech signal
KR101352608B1 (en) * 2011-12-07 2014-01-17 광주과학기술원 A method for extending bandwidth of vocal signal and an apparatus using it
US20130151125A1 (en) * 2011-12-08 2013-06-13 Scott K. Mann Apparatus and Method for Controlling Emissions in an Internal Combustion Engine
KR101398189B1 (en) * 2012-03-27 2014-05-22 광주과학기술원 Speech receiving apparatus, and speech receiving method
US8924209B2 (en) * 2012-09-12 2014-12-30 Zanavox Identifying spoken commands by templates of ordered voiced and unvoiced sound intervals
US9984706B2 (en) * 2013-08-01 2018-05-29 Verint Systems Ltd. Voice activity detection using a soft decision mechanism
US9570093B2 (en) * 2013-09-09 2017-02-14 Huawei Technologies Co., Ltd. Unvoiced/voiced decision for speech processing

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6453285B1 (en) * 1998-08-21 2002-09-17 Polycom, Inc. Speech activity detector for use in noise reduction system, and methods therefor
CN1470052A (en) * 2000-10-18 2004-01-21 ��˹��ŵ�� High frequency intensifier coding for bandwidth expansion speech coder and decoder
US7606703B2 (en) * 2000-11-15 2009-10-20 Texas Instruments Incorporated Layered celp system and method with varying perceptual filter or short-term postfilter strengths
US20050177364A1 (en) * 2002-10-11 2005-08-11 Nokia Corporation Methods and devices for source controlled variable bit-rate wideband speech coding
CN1703737A (en) * 2002-10-11 2005-11-30 诺基亚有限公司 Method for interoperation between adaptive multi-rate wideband (AMR-WB) and multi-mode variable bit-rate wideband (VMR-WB) codecs
CN1909060A (en) * 2005-08-01 2007-02-07 三星电子株式会社 Method and apparatus for extracting voiced/unvoiced classification information
CN101379551A (en) * 2005-12-28 2009-03-04 沃伊斯亚吉公司 Method and device for efficient frame erasure concealment in speech codecs
US20110313778A1 (en) * 2006-06-21 2011-12-22 Samsung Electronics Co., Ltd Method and apparatus for adaptively encoding and decoding high frequency band
WO2008151408A1 (en) * 2007-06-14 2008-12-18 Voiceage Corporation Device and method for frame erasure concealment in a pcm codec interoperable with the itu-t recommendation g.711
CN101261836A (en) * 2008-04-25 2008-09-10 清华大学 Method for enhancing excitation signal naturalism based on judgment and processing of transition frames
CN102664003A (en) * 2012-04-24 2012-09-12 南京邮电大学 Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HENNING PUDER,ETC.: ""AN APPROACH TO AN OPTIMIZED VOICE-ACTIVITY DETECTOR FOR NOISY SPEECH SIGNALS"", 《SIGNAL PROCESSING》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109119094A (en) * 2018-07-25 2019-01-01 苏州大学 A kind of voice classification method using vocal cords modeling inversion
CN112885380A (en) * 2021-01-26 2021-06-01 腾讯音乐娱乐科技(深圳)有限公司 Method, device, equipment and medium for detecting unvoiced and voiced sounds

Also Published As

Publication number Publication date
KR102007972B1 (en) 2019-08-06
EP3352169A1 (en) 2018-07-25
RU2636685C2 (en) 2017-11-27
SG10201701527SA (en) 2017-03-30
AU2014317525B2 (en) 2017-05-04
ES2687249T3 (en) 2018-10-24
KR20180095744A (en) 2018-08-27
KR101892662B1 (en) 2018-08-28
CA2918345C (en) 2021-11-23
MY185546A (en) 2021-05-19
SG11201600074VA (en) 2016-02-26
BR112016004544B1 (en) 2022-07-12
ZA201600234B (en) 2017-08-30
CN105359211B (en) 2019-08-13
RU2016106637A (en) 2017-10-16
EP3352169B1 (en) 2021-12-08
US11328739B2 (en) 2022-05-10
JP6470857B2 (en) 2019-02-13
EP3005364A4 (en) 2016-06-01
CA2918345A1 (en) 2015-03-12
KR20170102387A (en) 2017-09-08
CN110097896B (en) 2021-08-13
HK1216450A1 (en) 2016-11-11
JP2018077546A (en) 2018-05-17
US9570093B2 (en) 2017-02-14
ES2908183T3 (en) 2022-04-28
MX2016002561A (en) 2016-06-17
JP2016527570A (en) 2016-09-08
US20150073783A1 (en) 2015-03-12
US20180322895A1 (en) 2018-11-08
US20200005812A1 (en) 2020-01-02
US10043539B2 (en) 2018-08-07
MX352154B (en) 2017-11-10
JP6291053B2 (en) 2018-03-14
WO2015032351A1 (en) 2015-03-12
US10347275B2 (en) 2019-07-09
CN110097896A (en) 2019-08-06
AU2014317525A1 (en) 2016-02-11
BR112016004544A2 (en) 2017-08-01
KR101774541B1 (en) 2017-09-04
US20170110145A1 (en) 2017-04-20
EP3005364B1 (en) 2018-07-11
KR20160025029A (en) 2016-03-07
EP3005364A1 (en) 2016-04-13

Similar Documents

Publication Publication Date Title
US10249313B2 (en) Adaptive bandwidth extension and apparatus for the same
US9837092B2 (en) Classification between time-domain coding and frequency domain coding
US11328739B2 (en) Unvoiced voiced decision for speech processing cross reference to related applications

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant