CN108231083A

CN108231083A - A kind of speech coder code efficiency based on SILK improves method

Info

Publication number: CN108231083A
Application number: CN201810040152.2A
Authority: CN
Inventors: 李强; 张玲; 明艳; 王怡曼
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-01-16
Filing date: 2018-01-16
Publication date: 2018-06-29

Abstract

The present invention proposes that a kind of speech coder code efficiency based on SILK improves method.Specific implementation method includes：First, to adding specific noise in input speech signal, analog signal is generated, when then carrying out long to the analog signal and short-term prediction, the prediction gain of predictive filter is improved, the entropy of quantization index is reduced, so as to improve code efficiency；Secondly, in coding side pumping signal is determined by minimizing perceptual weighting reconstruction error, the spectral regions between quantizing noise and signal with high correlation are compressed using post-filtering in decoding end, different weighting filters is added by the input to noise shaper quantizer and reconstruction signal so that two above function is combined in the quantizer of coding side.Using method proposed by the present invention, side information is not both needed to, without change bitstream format, and the code efficiency of SILK can be improved.

Description

A kind of speech coder code efficiency based on SILK improves method

Technical field

The invention belongs to field of voice communication, especially relate to a kind of wideband acoustic encoder based on SILK, extensively Applied to videoconference, voice-over-net telephone service (Voice over Internet Protocol, VoIP), wireless communication, In the real-time speech communicatings scene such as gaming platform.

Background technology

Voice be the mankind most directly, most convenient, most efficient information transmission media, therefore the transmission of voice signal is most The basic function that number communication system has.With the development of science and technology, the non-voice informations such as image, text are in information transmission In occupation of increasing ratio, but it is still one of function of numerous communication system indispensabilities effectively to transmit voice messaging.

In digital communication systems, primary speech signal can be just transmitted, but voice signal after having to pass through digitlization After analog/digital conversion, data volume increases, and after such as carrying out 16kHz samplings, 16bit uniform quantizations to voice signal, numeric code rate can Reach 256kbps.The audio digital signals of high-speed need the bandwidth of bigger when being transmitted in communication network, this has band resource The communication system of limit, such as cellular mobile communication, increase transmission cost, therefore, it is necessary to digitized voice signal into Row compressed encoding.

1972, International Telegraph and Telephone Advisory Committee (Consultative Committee of International Telegraph and Telephone, CCITT) disclose the speech coding standard of 64kbps G.711, it uses pulse code (Pulse Code Modulation, the PCM) technology of modulation, applied in telephone communication service；1980, CCITT was formulated The modulation of 32kbps adaptive difference pulse codes (Adaptive Differential Pulse Code Modulation, ADPCM) speech coding standard is G.721；Subsequently, based on analysis synthesis speech coding algorithm become mainstream, 1992, CCITT Disclose short delay Code Excited Linear Prediction (Low Delay Code Excited the Linear Prediction, LD- of 16kbps CELP speech coding schemes) are G.728；8kbps conjugate structure algebraic code excited linear predictions are formulated within 1996 The voice of (Conjugate Structure Algebraic Code Excited Linear Prediction, CS-ACELP) Coding standard is G.729.The standard can be applied to VoIP and H.323 wait Speech Communications field.With being continuously increased for network bandwidth, Terminal processing capacity constantly enhances, and user requires also constantly improving to speech quality, broadband, ultra wide band, Whole frequency band voice Coding techniques is widely studied and applied.

In traditional narrowband speech coding standard, speech signal bandwidth is generally limited in the range of 300Hz~3400Hz, Sample frequency is 8kHz.The limitation of this bandwidth limits the naturalness of voice so that some special sound treatment effects are not to the utmost Comply with one's wishes, also restrict further improving for speech coding quality.In order to realize the voice communication of high-quality, people introduce frequency band Wideband speech coding for 50Hz~7000Hz.Relative to narrowband speech, the low-frequency range expansion of 50Hz~300Hz improves Naturalness, presence and the comfort level of voice；The high frequency expansion of 3400Hz~7000Hz, is rubbed due to can preferably distinguish Fricative and plosive etc., so as to improve the intelligibility of voice.Therefore, internal and international many research institutions and tissue for many years To be dedicated to the formulation of the research of wideband speech coding algorithm and standard always.Up to the present, a variety of width have been made Band speech coding standard：ITU-T G.722, ITU-T G.722.1, ITU-T it is G.722.2 wide with 3GPP2 variable bit rate multi-modes Band audio coder ＆ decoder (codec) (Variable-Rate Multimode Wideband, VMR-WB).3GPP2 selectes VMR-WB within 2003 Make the wideband voice codec of CDMA2000 systems.Subsequent ITU-T has also been proposed several new Embedded Broad-band voice codings Standard ITU-T G.729.1, ITU-T G.711.1 with ITU-T G.718, wherein 2006 formulate G.729.1 most represent Property.G.729.1 wideband extension (bandwidth expansion to 50Hz~7000Hz) has been carried out on the basis of G.729；In March, 2008 ITU- G.711.1 T has promulgated the Embedded Broad-band voice standardized and audio coding standard again, code rate 64kbps, 80kbps, 96kbps etc.；G.718 be standardized in June, 2008 one of ITU-T have to frame erasing the narrowband of robustness/ Broadband is embedded, variable rate voice and audio coder, code rate have 8kbps, 12kbps, 16kbps, 24kbps and Five kinds of 32kbps, when carrying out narrowband encoding and decoding, encoder only supports two kinds of rates of 8kbps, 12kbps, during wideband encoding, branch Hold all 5 kinds of rates；The broadband multi-rate speech coder of early stage is mainly used in video conference, and then mainly concentrates now On VoIP and wireless application.

With the development of Internet technology and popularizing for application, the low-cost networking telephone is furtherd investigate, the world Multiple standardization bodies and industrial bodies propose many corresponding speech coding schemes.Including International Telecommunication Union G.711, G.723.1 and G.729A, the voice codings such as iLBC, SILK that the industries enterprise such as GIPS companies and Skype companies proposes Algorithm.SILK is a set of encoding and decoding speech solution that Skype companies voluntarily develop, it supports adopting for 8,12,16 and 24kHz The multi-rate coding bit rate of sample frequency and 6~40kbps.The encoder can not only provide real-time retractility to adapt to The variation of network quality, and can be in the audio letter for being less than offer ultra wide band in the case of 50% than former network occupancy Number, even if in the case where packet loss is higher, it can still stablize the call tone quality for remaining certain.Due to can be in low bandwidth There is provided more good speech quality in environment, the application prospect of SILK is by extensive concern, and key algorithm grinds in SILK encoders Study carefully becomes the target that numerous researchers contend at present with further promoted of performance.Therefore, it designs a kind of high-quality based on SILK The speech coder of amount and high coding efficiency, and apply it to such as videoconference, VoIP, wireless communication, gaming platform in fact When voice communication scene in, have important research significance and application value.

The mode of redundancy coding and multiframe packing is supported when SILK is encoded, although which can enhance the appearance of SILK Wrong ability, but redundancy coding can increase bit rate, so as to influence the code efficiency of SILK.Therefore it is intended to and does not reduce Under the premise of coding quality, code efficiency is improved.

Invention content

It is proposed that a kind of code efficiency is higher, the coding quality preferably voice coder based on SILK in view of the deficiencies of the prior art Code device.Technical scheme is as follows：It includes the decoding step of the coding step of coding side and decoding end, wherein being based on The speech coder code efficiency of SILK improves method, and step is as follows：

101st, input speech signal carries out voice activation detection (Voice activation to input speech signal first Detection, VAD) processing, detect the pause occurred in voice, quiet interval and efficient voice ingredient；Meanwhile by voice Signal eliminates all direct current biasings and 50Hz or 60Hz buzzs by the high-pass filter that frequency is 70Hz；

102 then to voice signal carry out pitch analysis, SILK by open-loop pitch analysis to voice signal carry out it is pure and impure Sound is adjudicated, and the pitch period of Voiced signal is estimated, so as to obtain the auto-correlation coefficient of fundamental tone and fundamental tone time delay；

103rd, the output signal of high-pass filtering is subjected to noise shaping analysis (Noise Shaping Analysis, NSA), The gain used in prefilter and noise shaper quantizer and filter coefficient are obtained using NSA；

104th, the signal input generation analog signal module obtained pitch analysis and NSA, while pitch analysis is exported Signal carries out long-term prediction analysis (Long Time Prediction, LTP) and analyzes, and the output of NSA is carried out at pre-filtering Reason；

105th, to by generation analog signal and high-pass filtering treated the further forecast analysis of signal, then by its turn Line spectral frequencies (Linear spectral frequency, LSF) parameter is changed to, and feature is extracted using multi-stage vector quantization Parameter, then by the Parameter Switch after quantization be linear forecasting parameter (Linear Predictive Coding, LPC), pass through The synchronization of encoding and decoding is realized in this conversion；

106th, noise shaped quantization (Noise Shaping Quantizer, NSQ) is carried out on the basis of step 105, is led to Noise shaping is crossed so that noise spectrum follows the spectral change of signal so that noise is not easy to be audible；

107th, Interval Coding is carried out to the speech characteristic parameter extracted, realizes entire cataloged procedure.

Analog signal module is generated in further step 104 using comprising time-varying source filter model come encoded voice to believe Number, which consists of the following parts：

Input is made of the voice signal comprising some row successive frames；

First signal processing module, it is intended to the method that particular noise signals are added by the speech signal frame to input, To realize the operation to each voice signal generation analog signal in series of successive frames.

Second signal processing module, it is intended to determine the LPC coefficient signal based on analog signal frame；It further determines that and is based on The LPC residual signal of the LPC coefficient of input signal；

Third signal processing module, it is intended to encode to generate generation by LPC coefficient and LPC residual signal count The encoded signal of predicative sound signal.

Analog signal generation step is as follows：

A1：First first as analog output signal is added using input speech signal with the output of noise shaping filter A input, wherein noise shaping filter by it is long when shaping and shaping filter forms in short-term；

A2：White noise and the quantization gain analyzed by noise shaping is defeated as second of analog output signal Enter, wherein, white noise has following features, i.e. its variance is identical with the variance of quantizing noise；

A3：The output of step A1 and two analog signals obtained by A2 are added to the analog output signal that can be obtained to the end, Complete the generation of analog signal in step 104；

Noise shaper quantizer individually composes shaping to signal and coding noise in further step 106, can be Higher voice quality is obtained under identical bit.Prefilter output signal is multiplied by one and is calculated during NSA first Compensating gain G, then with synthesize shaping filter output be added, then subtract each other with the output of a predictive filter, finally A residual signals are obtained, the quantization multiplied by gains that the residual signals and NSA are obtained will be in obtained result and step 104 The specific noise input lattice quantizer of generation, the quantization index representative of quantizer are input to the excitation index of Interval Coding device, The output of predictive filter is added the output signal so as to be quantified with pumping signal, at the same again using quantized output signal as Synthesize shaping and the input of predictive filtering.It is different from classical NSQ, it is of the invention in NSQ noise shaping directly about quantifying Around device and input terminal is fed back to, the input terminal of quantizer is back to after the voice signal of input and output is compared.

It advantages of the present invention and has the beneficial effect that：

Method is improved using the speech coder code efficiency based on SILK in the present invention, coding quality can not influenced Under the premise of, coding bit rate is effectively reduced, it, can so as to fulfill a kind of high coding efficiency, the SILK speech coders of high quality It applies in the real-time speech communicatings scene such as videoconference, VoIP, wireless communication, gaming platform, therefore the present invention has well Application prospect and practical value.

Description of the drawings

Fig. 1 embodiment SILK voice coding flow charts provided by the invention

Fig. 2 present invention generates analog signal module diagram

Fig. 3 embodiment high efficiency SILK voice coding flow charts provided by the invention

Fig. 4 noise shaped quantization functional block diagrams of the present invention

Fig. 5 embodiment SILK tone decoding flow charts provided by the invention

Specific implementation method

Below in conjunction with attached drawing, the invention will be further described：

SILK speech coding principles block diagram is as shown in Figure 1, whole using source filter classical model, i.e., voice is generated Based on system modelling, by two stage filter, first order long-term prediction filter removes the periodic component in voiced speech, clearly Sound does not need to then carry out LTP processing；Second step is filtered in short-term, the redundancy between nearly sampling point is removed, here using primary LPC coefficient is calculated in lattice algorithm, then using the method for multi-stage vector quantization；Excitation is can be obtained by by this two stage filter Then signal carries out gain quantization, NSQ and normalization, Interval Coding is used to the signal after normalization.Specific implementation step is such as Under：

Step 1：Input speech signal carries out VAD processing to input speech signal first, detects what is occurred in voice Pause, quiet interval and efficient voice ingredient；Meanwhile voice signal by the high-pass filter that frequency is 70Hz is eliminated and is owned Direct current biasing and 50Hz or 60Hz buzzs；

Step 2：Then pitch analysis is carried out to voice signal, SILK carries out voice signal by open-loop pitch analysis Voicing decision estimates the pitch period of Voiced signal, so as to obtain the auto-correlation coefficient of fundamental tone and fundamental tone time delay；

Step 3：By the output signal of high-pass filtering carry out noise shaping analysis (Noise Shaping Analysis, NSA), the gain used in prefilter and noise shaper quantizer and filter coefficient are obtained using NSA；

Step 4：The signal input generation analog signal module that pitch analysis and NSA are obtained, at the same it is defeated to pitch analysis Go out signal and carry out long-term prediction analysis analysis, pre-filtering processing is carried out to the output of NSA；

Step 5：To by generation analog signal and high-pass filtering treated the further forecast analysis of signal, then will It is converted to LSF parameters, and using multi-stage vector quantization to extract characteristic parameter, then by the Parameter Switch after quantization is linear Prediction Parameters realize the synchronization of encoding and decoding by this conversion；

Step 6：Noise shaped quantization is carried out on the basis of step 5, is followed by noise shaping noise spectrum The spectral change of signal so that noise is not easy to be audible；

Step 7：Interval Coding is carried out to the speech characteristic parameter extracted, realizes entire cataloged procedure.

What Fig. 2 was provided be it is a kind of improve code efficiency specific implementation method, coding side generate one kind can and frequency spectrum The analog signal that feature matches replaces original input signal with the analog signal, then in conjunction with to analog signal it is long when it is pre- It surveys and short-term prediction, the prediction gain to cause predictive filter gets a promotion, and the entropy of quantization index is reduced, so as to Reduce bit rate required during transmission encoding speech signal, the code efficiency of the encoder of raising.

Using comprising time-varying source filter model, come encoding speech signal, which consists of the following parts：

Input is made of the voice signal comprising some row successive frames；

Analog signal generation step is as follows：

S1：First first as analog output signal is added using input speech signal with the output of noise shaping filter A input, wherein noise shaping filter by it is long when shaping and shaping filter forms in short-term；

S2：White noise and the quantization gain analyzed by noise shaping is defeated as second of analog output signal Enter, wherein, white noise has following features, i.e. its variance is identical with the variance of quantizing noise；

S3：The simulation output that the output of step 1 and two analog signals obtained by step 2, which is added, can obtain to the end is believed Number, complete the generation of analog signal in step 4；

Rationally it is added to SILK speech coders by the way that analog signal module will be generated, the quantizing noise that step 6 is obtained Input as NSQ, you can obtain high efficiency SILK speech coders shown in Fig. 3.It replaces being originally inputted with the analog signal Signal, then in conjunction with the long-term prediction and short-term prediction to analog signal, the prediction gain to cause predictive filter is carried It rises, the entropy of quantization index is reduced, and bit rate required during encoding speech signal, the volume of the encoder of raising are transmitted so as to reduce Code efficiency.

NSQ module quantifies residual signals, while pumping signal can also be generated.In coding side by minimizing perceptual weighting Reconstruction error determines pumping signal, has higher phase between quantizing noise and signal using post-filtering to compress in decoding end The spectral regions of closing property, the NSQ in the present invention by adding different weighting filter to input and reconstruction signal so that Two above function is combined in the quantizer of encoder.The two operations of coding side are integrated not just to simplification Decoding end, and for make coding side using arbitrary simple/complicated sensor model come synchronize/by oneself shaping quantization make an uproar Sound and enhance/inhibit spectral regions, using this model, do not need to spend side information or change bitstream format.Fig. 4 is Embodiment noise shaped quantization functional block diagram provided by the invention, predictive filter includes the filters of two kinds of predictions of LPC and LTP in figure Wave device.F_anaAnd F_synAnalysis and composite noise shaping filter respectively, for unvoiced frame they all comprising it is long when and in short-term two Kind wave filter, the excitation index of quantization are represented by i (n).LTP coefficient, gain and each subframe update of noise shaping coefficient are primary, And then per frame, update is primary for LPC coefficient.The output of NSQ quantizers is obtained by formula (1)：

The first part of formula (1) is input signal shaping unit, and second part is quantized noise shaping part.

Fig. 5 embodiment SILK tone decoding flow charts provided by the invention.

In receiving terminal, the data packet received is divided into many frames by becoming code length decoder, these frames are included in data Bao Zhong.The necessary information of the output signal of one 20ms frame of reconstruct is included per frame.

Step 1：Section decoder.Change module decoded speech characteristic parameter from the bit stream received, change the defeated of module Go out including the generation pulse of pumping signal and the index of gain and LTP and LSF code books, which is used to decode LTP and LPC Coefficient, and the coefficient can be used for carrying out LTP and lpc analysis to pumping signal；

Step 2：Decoding parametric.Pulse and gain can be obtained after step 1 decoding, if the speech frame that decoding obtains is Unvoiced frame, then can decode the target code book and index of LTP, LTP coefficient be decoded by the target code book of LTP, to every frame In four subframes in all similarly handled；LPC coefficient then decodes to obtain by LSF code books, each vector in code book Come from each stage in code book；

Step 3：Generate pumping signal.Pulse signal and quantization multiplied by gains obtain pumping signal；

Step 4：LTP is synthesized.It, should using pumping signal e (n) as the input of LTP composite filters for voiced speech Wave filter can rebuild one by LTP analysis filters remove it is long when autocorrelation sequence, and pass through formula (2) generate one LPC pumping signal e_LPC (n)；

Wherein, L is fundamental tone time delay, and b_i is decoding LTP coefficient；

For voiceless sound, output signal is then the simple copy of pumping signal, i.e. e_LPC (n)=e (n)；

Step 5：LPC is synthesized.LPC composite filters reconstruct the auto-correlation in short-term fallen by lpc analysis filters filter Value, LPC pumping signal e_LPC (n) are filtered by LTP coefficient a_i, and decoded signal can be obtained according to formula (3)：

Wherein d_LPC is the exponent number of LPC composite filters, and y (n) is decoded output signal.

The above embodiment is interpreted as being merely to illustrate the present invention rather than limit the scope of the invention. After the content for having read the record of the present invention, technical staff can make various changes or modifications the present invention, these equivalent changes Change and modification equally falls into the scope of the claims in the present invention.

Claims

1. a kind of speech coder code efficiency based on SILK improves method, which is characterized in that it includes the coding of coding side The decoding step of step and decoding end, wherein the speech coder code efficiency based on SILK improves method, step is specially：

101st, input speech signal carries out VAD processing to input speech signal first, detects the pause occurred in voice, quiet Silent interval and efficient voice ingredient, meanwhile, by voice signal by the high-pass filter that cutoff frequency is 70Hz, eliminate all straight Stream biasing and 50Hz or 60Hz buzzs；

102nd, pitch analysis is carried out to voice signal, SILK carries out clear/voiced sound to voice signal by open-loop pitch analysis and adjudicates, The pitch period of Voiced signal is estimated, obtains the auto-correlation coefficient of fundamental tone and fundamental tone time delay；

103rd, the output signal of high-pass filtering is subjected to noise shaping analysis, analyzes to obtain prefilter and make an uproar using noise shaping The gain used in sound shaper quantizer and filter coefficient；

104th, the signal for analyzing pitch analysis and noise shaping, input generation analog signal module, while to fundamental tone point It analyses output signal and carries out LTP analyses, pre-filtering processing is carried out to the output of noise shaping analysis；

105th, to by generation analog signal module and high-pass filtering, treated that voice signal further carries out forecast analysis, so After extract LSF parameters, and using multi-stage vector quantization to extract characteristic parameter, then by the Parameter Switch after quantization be LPC Coefficient realizes the synchronization of encoding and decoding by this conversion；

106th, noise shaped quantization is carried out on the basis of step 105, signal is followed by noise shaping noise spectrum Spectral change makes noise be not easy to be audible；

2. a kind of speech coder code efficiency based on SILK according to claim 1 improves method, it is characterised in that In step 104, generation analog signal module using comprising time-varying source filter model come encoding speech signal, the encoder by with Lower part is grouped as：

Input is made of the voice signal comprising some row successive frames；

First signal processing module, it is intended to which the method for adding particular noise signals by the speech signal frame to input is come real Now to the operation of each voice signal generation analog signal in series of successive frames；

Second signal processing module, it is intended to determine the LPC coefficient signal based on analog signal frame, further determine that based on input The LPC residual signal of the LPC coefficient of signal；

Third signal processing module, it is intended to represent language by count encoding to LPC coefficient and LPC residual signal to generate The encoded signal of sound signal；

Analog signal generation step is as follows：

A1：It is added first using input speech signal with the output of noise shaping filter defeated as first of analog output signal Enter, wherein noise shaping filter by it is long when shaping and shaping filter forms in short-term；

A2：It is inputted using white noise and by second of the quantization gain that noise shaping is analyzed as analog output signal, Wherein, white noise has following features, i.e. its variance is identical with the variance of quantizing noise；

A3：The output of step A1 and two analog signals obtained by A2 are added to the analog output signal that can be obtained to the end, completed The generation of analog signal in step 104.

3. a kind of generation analog signal module realizing method according to claim 2, it is characterised in that：Volume in step A4 Code device consists of the following parts：

Input is made of the voice signal comprising some row successive frames；

Third signal processing module, it is intended to represent language by count encoding to LPC coefficient and LPC residual signal to generate The encoded signal of sound signal.

4. a kind of raising method of the code efficiency of speech coder based on SILK according to claim 1, feature It is：Noise shaper quantizer individually composes shaping to signal and coding noise in step 106, can be under identical bit Higher voice quality is obtained, prefilter output signal first is multiplied by a compensating gain G calculated during NSA, Then it is added, then subtract each other with the output of a predictive filter with synthesizing the output of shaping filter, finally obtains a residual error Obtained result is inputted a lattice quantizer, quantization by signal, the quantization multiplied by gains that the residual signals and NSA are obtained The quantizating index representative of device is input to the excitation index of Interval Coding device, the output of predictive filter is added with pumping signal thus The output signal quantified, while again using quantized output signal as synthesis shaping and the input of predictive filtering.