WO2020223797A1

WO2020223797A1 - Methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack

Info

Publication number: WO2020223797A1
Application number: PCT/CA2020/050582
Authority: WO
Inventors: Vaclav Eksler
Original assignee: Voiceage Corporation
Priority date: 2019-05-07
Filing date: 2020-05-01
Publication date: 2020-11-12
Also published as: JP2022532094A; EP3966818A4; CN113826161A; BR112021020507A2; KR20220006510A; CA3136477A1; EP3966818A1; US20220180884A1

Abstract

A method and device for detecting an attack in a sound signal to be coded wherein the sound signal is processed in successive frames each including a number of sub-frames. The device comprises a first-stage attack detector for detecting the attack in a last sub-frame of a current frame, and a second-stage attack detector for detecting the attack in one of the sub-frames of the current frame, including the sub-frames preceding the last sub-frame. No attack is detected when the current frame is not an active frame previously classified to be coded using a generic coding mode. A method and device for coding an attack in a sound signal are also provided. The coding device comprises the above mentioned attack detecting device and an encoder of the sub-frame comprising the detected attack using a transition coding mode using a glottal-shape codebook populated with glottal impulse shapes.

Description

METHODS AND DEVICES FOR DETECTING AN ATTACK IN A SOUND SIGNAL TO BE CODED AND FOR CODING THE DETECTED ATTACK

TECHNICAL FIELD

[0001] The present disclosure relates to a technique for coding a sound signal, for example speech or an audio signal, in view of transmitting and synthesizing this sound signal.

[0002] More specifically, but not exclusively, the present disclosure relates to methods and devices for detecting an attack in a sound signal to be coded, for example speech or an audio signal, and for coding the detected attack.

[0003] In the present disclosure and the appended claims:

- the term “attack” refers to a low-to-high energy change of a signal, for example voiced onsets (transitions from an unvoiced speech segment to a voiced speech segment), other sound onsets, transitions, plosives, etc., generally characterized by an abrupt energy increase within a sound signal segment.

- the term“onset” refers to the beginning of a significant sound event, for example speech, a musical note, or other sound;

- the term“plosive” refers, in phonetics, to a consonant in which the vocal tract is blocked so that all airflow ceases; and

- the term “coding of the detected attack” refers to the coding of a sound signal segment whose length is generally few milliseconds after the beginning of the attack. BACKGROUND

[0004] A speech encoder converts a speech signal into a digital bit stream which is transmitted over a communication channel or stored in a storage medium. The speech signal is digitized, that is sampled and quantized with usually 16-bits per sample. The speech encoder has the role of representing these digital samples with a smaller number of bits while maintaining a good subjective speech quality. A speech decoder or synthesizer operates on the transmitted or stored digital bit stream and converts it back to a speech signal.

[0005] CELP (Code-Excited Linear Prediction) coding is one of the best techniques for achieving a good compromise between subjective quality and bit rate. This coding technique forms the basis of several speech coding standards both in wireless and wireline applications. In CELP coding, the sampled speech signal is processed in successive blocks of M samples usually called frames, where M is a predetermined number of speech samples corresponding typically to 10-30 ms. A LP (Linear Prediction) filter is calculated and transmitted every frame. The calculation of the LP filter typically needs a lookahead, for example a 5-15 ms speech segment from the subsequent frame. Each M- sample frame is divided into smaller blocks called sub-frames. Usually the number of sub-frames is two to five resulting in 4-10 ms sub-frames. In each sub-frame, an excitation is usually obtained from two components, a past excitation contribution and an innovative, fixed codebook excitation contribution. The past excitation contribution is often referred to as the pitch or adaptive codebook excitation contribution. The parameters characterizing the excitation are coded and transmitted to the decoder, where the excitation is reconstructed and supplied as input to a LP synthesis filter.

[0006] CELP-based speech codecs rely heavily on prediction to achieve their high performance. Such prediction can be of different types but usually comprises the use of an adaptive codebook storing an adaptive codebook excitation contribution selected from previous frames. A CELP encoder exploits the quasi periodicity of voiced speech by searching in the past adaptive codebook excitation contribution the segment most similar to the segment being currently coded. The same past adaptive codebook excitation contribution is also stored in the decoder. It is then sufficient for the encoder to send a pitch delay and a pitch gain for the decoder to reconstruct the same adaptive codebook excitation contribution as used in the encoder. The evolution (difference) between the previous speech segment and the currently coded speech segment is further modeled using a fixed codebook excitation contribution selected from a fixed codebook.

[0007] A problem related to prediction inherent to CELP- based speech codecs appears in the presence of transmission errors (erased frames or packets) when the state of the encoder and the state of the decoder become desynchronized. Due to prediction, the effect of an erased frame is not limited to the erased frame, but continues to propagate after the frame erasure, often during several following frames. Naturally, the perceptual impact can be very annoying. Attacks such as transitions from an unvoiced speech segment to a voiced speech segment (for example transitions between a consonant or a period of inactive speech, and a vowel) or transitions between two different voiced segments (for example transitions between two vowels) are amongst the most problematic cases for frame erasure concealment. When a transition from an unvoiced speech segment to a voiced speech segment (voiced onset) is lost, the frame right before the voiced onset frame is unvoiced or inactive and thus no meaningful excitation contribution is found in the buffer of the adaptive codebook. At the encoder, the past excitation contribution builds up in the adaptive codebook during the voiced onset frame, and the following voiced frame is coded using this past adaptive codebook excitation contribution. Most frame error concealment techniques use the information from the last correctly received frame to conceal the missing frame. When the voiced onset frame is lost, the buffer of the adaptive codebook at the decoder will be thus updated using the noise-like adaptive codebook excitation contribution of the previous frame (unvoiced or inactive frame). The periodic part (adaptive codebook excitation contribution) of the excitation is thus completely missing in the adaptive codebook at the decoder after a lost voiced onset and it can take up to several frames for the decoder to recover from this loss. A similar situation occurs in the case of lost voiced to voiced transition. In that case, the excitation contribution stored in the adaptive codebook before the transition frame has typically very different characteristics from the excitation contribution stored in the adaptive codebook after the transition. Again, as the decoder usually conceals the lost frame with the use of the past frame information, the state of the encoder and the state of the decoder will be very different, and the synthesized signal can suffer from important distortion. A solution to this problem was introduced in Reference [2] where, in a frame following the transition frame, the inter-frame dependent adaptive codebook is replaced by a non-predictive glottal-shape codebook.

[0008] Another issue when coding transition frames in CELP-based codecs is coding efficiency. When a codec processes transitions where the previous and current segment excitations are very different, the coding efficiency decreases. These instances usually occur in frames that encode attacks such as voiced onsets (transitions from an unvoiced speech segment to a voiced speech segment), other sound onsets, transitions between two different voiced segments (for example transitions between two vowels), plosives, etc. The following two issues mostly contribute to such decrease in efficiency (Reference mostly [1]). As a first issue, efficiency of the long-term prediction is poor and, thus, contribution of the adaptive codebook excitation contribution to the total excitation is weak. A second issue is related to the gain quantizers, often designed as vector quantizers using a limited bit-budget, which are usually not able to adequately react to an abrupt energy increase within a frame. The more this abrupt energy increase occurs close to the end of a frame, the more critical the second issue is.

[0009] To overcome the above-discussed issues, there is a need for a method and device for improving the coding efficiency of frames including attacks such as onset frames and transition frames and, more generally, to improve coding quality in CELP-based codecs. SUMMARY

[0010] According to a first aspect, the present disclosure relates to a method for detecting an attack in a sound signal to be coded wherein the sound signal is processed in successive frames each including a number of sub-frames. The method comprises a first-stage attack detection for detecting the attack in a last sub-frame of a current frame, and a second-stage attack detection for detecting the attack in one of the sub-frames of the current frame, including the sub-frames preceding the last sub-frame.

[0011] The present disclosure also relates to a method for coding an attack in a sound signal, comprising the above-defined attack detecting method. The coding method comprises encoding the sub-frame comprising the detected attack using a coding mode with a non-predictive codebook.

[0012] According to another aspect, the present disclosure is concerned with a device for detecting an attack in a sound signal to be coded wherein the sound signal is processed in successive frames each including a number of sub- frames. The device comprises a first-stage attack detector for detecting the attack in a last sub-frame of a current frame, and a second-stage attack detector for detecting the attack in one of the sub-frames of the current frame, including the sub-frames preceding the last sub-frame.

[0013] The present disclosure is further concerned with a device for coding an attack in a sound signal, comprising the above-defined attack detecting device and an encoder of the sub-frame comprising the detected attack using a coding mode with a non-predictive codebook.

[0014] The foregoing and other objects, advantages and features of the methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack will become more apparent upon reading of the following non-restrictive description of illustrative embodiments thereof, given by way of example only with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] In the appended drawings:

[0016] Figure 1 is a schematic block diagram of a sound processing and communication system depicting a possible context of implementation of the methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack;

[0017] Figure 2 is a schematic block diagram illustrating the structure of a CELP- based encoder and decoder, forming part of the sound processing and communication system of Figure 1 ;

[0018] Figure 3 is a block diagram illustrating concurrently the operations of an EVS (Enhanced Voice Services) coding mode classifying method and the modules of an EVS coding mode classifier;

[0019] Figure 4 is a block diagram illustrating concurrently the operations of a method for detecting an attack in a sound signal to be coded and the modules of an attack detector for implementing the method;

[0020] Figure 5 is a graph of a first non-restrictive, illustrative example showing the impact of the attack detector of Figure 4 and a TC (Transition Coding) coding mode on the quality of a decoded speech signal, wherein curve a) represents an input speech signal, curve b) represents a reference speech signal synthesis, and curve c) represents the improved speech signal synthesis when the attack detector of Figure 4 and the TC coding mode are used for processing an onset frame; [0021] Figure 6 is a graph of a second non-restrictive, illustrative example showing the impact of the attack detector of Figure 4 and TC coding mode on the quality of a decoded speech signal, wherein curve a) represents an input speech signal, curve b) represents a reference speech signal synthesis, and curve c) represents the improved speech signal synthesis when the attack detector of Figure 4 and the TC coding mode are used for processing an onset frame; and

[0022] Figure 7 is a simplified block diagram of an example configuration of hardware components for implementing the methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack.

DETAILED DESCRIPTION

[0023] Although the non-restrictive illustrative embodiments of the methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack will be described in the following description in connection with a speech signal and a CELP- based codec, it should be kept in mind that these methods and devices are not limited to an application to speech signals and CELP- based codecs but their principles and concepts can be applied to any other types of sound signals and codecs.

[0024] The following description is concerned with detecting an attack in a sound signal, for example speech or an audio signal, and forcing a Transition Coding ( TC) mode in sub-frames where an attack is detected. The detection of an attack may also be used for selecting a sub-frame in which a glottal-shape codebook, as part of the TC coding mode, is employed in the place of an adaptive codebook.

[0025] In the EVS codec as described in Reference [4], when a detection algorithm detects an attack in the last sub-frame of a current frame, a glottal-shape codebook of the TC coding mode is used in this last sub-frame. In the present disclosure, the detection algorithm is complemented with a second-stage logic to not only detect a larger number of frames including an attack but also, upon coding of such frames, to force the use of the TC coding mode and corresponding glottal- shape codebook in all sub-frames in which an attack is detected.

[0026] The above technique improves coding efficiency of not only attacks detected in a sound signal to be coded but, also, of certain music segments (e.g. castanets). More generally, coding quality is improved.

[0027] Figure 1 is a schematic block diagram of a sound processing and communication system 100 depicting a possible context of implementation of the methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack as disclosed in the following description.

[0028] The sound processing and communication system 100 of Figure 1 supports transmission of a sound signal across a communication channel 101 . The communication channel 101 may comprise, for example, a wire or an optical fiber link. Alternatively, the communication channel 101 may comprise at least in part a radio frequency link. The radio frequency link often supports multiple, simultaneous communications requiring shared bandwidth resources such as may be found with cellular telephony. Although not shown, the communication channel 101 may be replaced by a storage device in a single device implementation of the system 100 that records and stores the encoded sound signal for later playback.

[0029] Still referring to Figure 1 , for example a microphone 102 produces an original analog sound signal 103. As indicated in the foregoing description, the sound signal 103 may comprise, in particular but not exclusively, speech and/or audio.

[0030] The analog sound signal 103 is supplied to an analog-to-digital (A/D) converter 104 for converting it into an original digital sound signal 105. The original digital sound signal 105 may also be recorded and supplied from a storage device (not shown).

[0031] A sound encoder 106 encodes the digital sound signal 105 thereby producing a set of encoding parameters that are multiplexed under the form of a bit stream 107 delivered to an optional error-correcting channel encoder 108. The optional error-correcting channel encoder 108, when present, adds redundancy to the binary representation of the encoding parameters in the bit stream 107 before transmitting the resulting bit stream 1 1 1 over the communication channel 101.

[0032] On the receiver side, an optional error-correcting channel decoder 109 utilizes the above mentioned redundant information in the received digital bit stream 1 1 1 to detect and correct errors that may have occurred during transmission over the communication channel 101 , producing an error-corrected bit stream 1 12 with received encoding parameters. A sound decoder 1 10 converts the received encoding parameters in the bit stream 1 12 for creating a synthesized digital sound signal 1 13. The digital sound signal 1 13 reconstructed in the sound decoder 110 is converted to a synthesized analog sound signal 1 14 in a digital-to-analog (D/A) converter 1 15.

[0033] The synthesized analog sound signal 1 14 is played back in a loudspeaker unit 1 16 (the loudspeaker unit 1 16 can obviously be replaced by a headphone). Alternatively, the digital sound signal 1 13 from the sound decoder 1 10 may also be supplied to and recorded in a storage device (not shown).

[0034] As a non-limitative example, the methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack according to the present disclosure can be implemented in the sound encoder 106 and decoder 1 10 of Figure 1. It should be noted that the sound processing and communication system 100 of Figure 1 , along with the methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack, can be extended to cover the case of stereophony where the input of the encoder 106 and the output of the decoder 1 10 consist of left and right channels of a stereo sound signal. The sound processing and communication system 100 of Figure 1 , along with the methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack, can be further extended to cover the case of multi-channel and/or scene-based audio and/or independent streams encoding and decoding (e.g. surround and high-order ambisonics).

[0035] Figure 2 is a schematic block diagram illustrating the structure of a CELP- based encoder and decoder which, according to the illustrative embodiments, is part of the sound processing and communication system 100 of Figure 1. As illustrated in Figure 2, a sound codec comprises two basic parts: the sound encoder 106 and the sound decoder 1 10 both introduced in the foregoing description of Figure 1. The encoder 106 is supplied with the original digital sound signal 105, determines the encoding parameters 107, described herein below, representing the original analog sound signal 103. These parameters 107 are encoded into the digital bit stream 1 1 1. As already explained, the bit stream 1 1 1 is transmitted using a communication channel, for example the communication channel 101 of Figure 1 , to the decoder 1 10. The sound decoder 1 10 reconstructs the synthesized digital sound signal 1 13 to be as similar as possible to the original digital sound signal 105.

[0036] Presently, the most widespread speech coding techniques are based on Linear Prediction ( LP ), in particular CELP. In LP-based coding, the synthesized digital sound signal 230 (Figure 2) is produced by filtering an excitation 214 through a LP synthesis filter 216 having a transfer function 1/A(z). An example of procedure to find the filter parameters A(z) of the LP filter can be found in Reference [4].

[0037] In CELP, the excitation 214 is typically composed of two parts: a first- stage, adaptive-codebook contribution 222 produced by selecting a past excitation signal v(n) from an adaptive codebook 218 in response to an index t (pitch lag) and by amplifying the past excitation signal v(n) by an adaptive-codebook gain g_p 226 and a second-stage, fixed-codebook contribution 224 produced by selecting an innovative codevector C_k(n) from a fixed codebook 220 in response to an index k and by amplifying the innovative codevector C_k(n) by a fixed-codebook gain g_c 228. Generally speaking, the adaptive codebook contribution 222 models the periodic part of the excitation and the fixed codebook excitation contribution 224 is added to model the evolution of the sound signal.

[0038] The sound signal is processed by frames of typically 20 ms and the filter parameters A(z) of the LP filter are transmitted from the encoder 106 to the decoder 1 10 once per frame. In CELP, the frame is further divided in several sub- frames to encode the excitation. The sub-frame length is typically 5 ms.

[0039] CELP uses a principle called Analysis-by-Synthesis where possible decoder outputs are tried (synthesized) already during the coding process at the encoder 106 and then compared to the original digital sound signal 105. The encoder 106 thus includes elements similar to those of the decoder 1 10. These elements includes an adaptive codebook excitation contribution 250 (corresponding to the adaptive-codebook contribution 222 at the decoder 110) selected in response to the index t (pitch lag) from an adaptive codebook 242 (corresponding to the adaptive codebook 218 at the decoder 1 10) that supplies a past excitation signal v(n) convolved with the impulse response of a weighted synthesis filter H(z) 238 (cascade of the LP synthesis filter 1/A(z) and a perceptual weighting filter W(z)), the output y₁(n) of which is amplified by an adaptive-codebook gain g_p 240 (corresponding to the adaptive-codebook gain 226 at the decoder 1 10). These elements also include a fixed codebook excitation contribution 252 (corresponding to the fixed-codebook contribution 224 at the decoder 1 10) selected in response to the index k from a fixed codebook 244 (corresponding to the fixed codebook 220 at the decoder 1 10) that supplies an innovative codevector C_k(n) convolved with the impulse response of the weighted synthesis filter H(z) 246, the output y₂(n) of which is amplified by a fixed codebook gain g_c 248 (corresponding to the fixed-codebook gain 228 at the decoder 1 10). [0040] The encoder 106 comprises the perceptual weighting filter W(z) 233 and a calculator 234 of a zero-input response of the cascade (H(z)) of the LP synthesis filter 1/A(z) and the perceptual weighting filter W(z). Subtractors 236, 254 and 256 respectively subtract the zero-input response from calculator 234, the adaptive codebook contribution 250 and the fixed codebook contribution 252 from the original digital sound signal 105 filtered by the perceptual weighting filter 233 to provide an error signal used to calculate a mean-squared error 232 between the original digital sound signal 105 and the synthesized digital sound signal 113 (Figure 1 ).

[0041] The adaptive codebook 242 and the fixed codebook 244 are searched to minimize the mean-squared error 232 between the original digital sound signal 105 and the synthesized digital sound signal 113 in a perceptually weighted domain, where discrete time index n = 0, 1 ,..., N-1, and N is the length of the sub-frame. Minimization of the mean-squared error 232 provides the best candidate past excitation signal v(n) (identified by the index t) and innovative codevector c_k(n) (identified by the index k) for coding the digital sound signal 105. The perceptual weighting filter W(z) exploits the frequency masking effect and typically is derived from the LP filter A(z). An example of perceptual weighting filter W(z) for WB (wideband, bandwidth of typically 50 - 7000 Hz) signals can be found in Reference [4].

[0042] Since the memory of the LP synthesis filter 1/A(z) and the weighting filter W(z) is independent from the searched innovative codevector c_k(n), this memory (zero-input response of the cascade ( H(z )) of the LP synthesis filter 1/A(z) and the perceptual weighting filter W(z)) can be subtracted (subtractor 236) from the original digital sound signal 105 prior to the fixed codebook search. Filtering of the candidate innovative codevector c_k(n) can then be done by means of a convolution with the impulse response of the cascade of the filters 1/A(z) and W(z), represented by H(z) in Figure 2. [0043] The digital bit stream 11 1 transmitted from the encoder 106 to the decoder 110 contains typically the following parameters 107: quantized parameters of the LP filter A(z), index t of the adaptive codebook 242 and index k of the fixed codebook 244, and the gains g_p 240 and g_c 248 of the adaptive codebook 242 and of the fixed codebook 244. In the decoder 110:

- the received quantized parameters of the LP filter A(z) are used to build the LP synthesis filter 216;

- the received index t is applied to the adaptive codebook 218;

- the received index k is applied to the fixed codebook 220;

- the received gain g_p is used as adaptive-codebook gain 226; and

- the received gain g_c is used as fixed-codebook gain 228.

[0044] Further explanations on the structure and operation of CELP- based encoder and decoder can be found, for example, in Reference [4]

[0045] Also, although the following description makes reference to the EVS Standard (Reference [4]), it should be kept in mind that the concepts, principles, structures and operations as described therein may be applied to other sound/speech processing and communication Standards.

Coding of Voiced Onsets

[0046] To obtain better coding performance, the LP- based core of the EVS codec as described in Reference [4] uses a signal classification algorithm and six (6) distinct coding modes tailored for each category of signal, namely the Inactive Coding (IC) mode, Unvoiced Coding (UC) mode, Transition Coding (TC) mode, Voiced Coding (VC) mode, Generic Coding (GC) mode, and Audio Coding (AC) mode (not shown).

[0047] Figure 3 is a simplified high-level block diagram illustrating concurrently the operations of an EVS coding mode classifying method 300 and the modules of an EVS coding mode classifier 320.

[0048] Referring to Figure 3, the coding mode classifying method 300 comprises an active frame detection operation 301 , an invoiced frame detection operation 302, a frame after onset detection operation 303 and a stable voiced frame detection operation 304.

[0049] To perform the active frame detection operation 301 , an active frame detector 311 determines whether the current frame is active or inactive. For that purpose, sound activity detection (SAD) or voice activity detection (VAD) can be used. If an inactive frame is detected, the 1C coding mode 321 is selected and the procedure is terminated.

[0050] If the detector 311 detects an active frame during the active frame detection operation 301 , the unvoiced frame detection operation 302 is performed using an unvoiced frame detector 312. Specifically, if an unvoiced frame is detected, the unvoiced frame detector 312 selects, to code the detected unvoiced frame, the UC coding mode 322. The UC coding mode is designed to code unvoiced frames. In the UC coding mode, the adaptive codebook is not used and the excitation is composed of two vectors selected from a linear Gaussian codebook. Alternatively, the coding mode in UC may be composed of a fixed algebraic codebook and a Gaussian codebook.

[0051] If the current frame is not classified as unvoiced by the detector 312, the frame after onset detection operation 303 and a corresponding frame after onset detector 313, and the stable voiced frame detection operation 304 and a corresponding stable voiced frame detector 314 are used.

[0052] In the frame after onset detection operation 303, the detector 313 detects voiced frames following voiced onsets and selects the TC coding mode 323 to code these frames. The TC coding mode 323 is designed to enhance the codec performance in the presence of frame erasures by limiting the usage of past information (adaptive codebook). To minimize at the same time the impact of the TC coding mode 323 on a clean channel performance (without frame erasures), mode 323 is used only on the most critical frames from a frame erasure point of view. These most critical frames are voiced frames following voiced onsets.

[0053] If the current frame is not a voiced frame following a voiced onset, the stable voiced frame detection operation 304 is performed. During this operation, the stable voiced frame detector 314 is designed to detect quasi-periodic stable voiced frames. If the current frame is detected as a quasi-periodic stable voiced frame, the detector 314 selects the VC coding mode 324 to encode the stable voiced frame. The selection of the VC coding mode by the detector 314 is conditioned by a smooth pitch evolution. This uses Algebraic Code-Excited Linear Prediction ( ACELP ) technology, but given that the pitch evolution is smooth throughout the frame, more bits are assigned to the fixed (algebraic) codebook than in the GC coding mode.

[0054] If the current frame is not classified into one of the above frame categories during the operations 301-304, this frame is likely to contain a nonstationary speech segment and the detector 314 selects, for encoding such frame, the GC coding mode 325, for example a generic ACELP coding mode.

[0055] Finally, a speech/music classification algorithm (not shown) of the EVS Standard is run to decide whether the current frame shall be coded using the AC mode. The AC mode has been designed to efficiently code generic audio signals, in particular but not exclusively music.

[0056] In order to improve codec's performance for noisy channels, a refinement of the coding mode classification method described in the previous paragraphs with reference to Figure 3, called frame classification for Frame Error Concealment ( FEC ) is applied (Reference [4]). The basic idea behind using a different frame classification approach for FEC is the fact that an ideal strategy for FEC should be different for quasi-stationary speech segments and for speech segments with rapidly changing characteristics. In the EVS Standard (Reference [4]), the frame classification for FEC used at the encoder defines five (5) distinct classes as follows. UNVOICED class comprises all unvoiced speech frames and all frames without active speech. A voiced offset frame can also be classified as UNVOICED class if its end tends to be unvoiced. UNVOICED TRANSITION class comprises unvoiced frames with a possible voiced onset at the end of the frame. VOICED TRANSITION class comprises voiced frames with relatively weak voiced characteristics. VOICED class comprises voiced frames with stable characteristics. ONSET class comprises all voiced frames with stable characteristics following a frame classified as UNVOICED class or UNVOICED TRANSITION class.

[0057] Further explanations on the EVS coding mode classifying method 300 and the EVS coding mode classifier 320 of Figure 3 can be found, for example, in Reference [4]

[0058] Originally, the TC coding mode was introduced to be used in frames following a transition for helping to stop error propagation in case a transition frame is lost (Reference [4]). In addition, the TC coding mode can be used in transition frames to increase coding efficiency. In particular, just before a voiced onset, the adaptive codebook usually contains a noise-like signal not very useful or efficient for coding the beginning of a voiced segment. The goal is to supplement the adaptive codebook with a better, non-predictive codebook populated with simplified quantized versions of glottal impulse shapes to encode the voiced onsets. The glottal-shape codebook is used only in one sub-frame containing the first glottal impulse within the frame, more precisely in the sub-frame where the LP residual signal (s_w(n) in Figure 2) has its maximum energy within the first pitch period of the frame. Further explanations on the TC coding mode of Figure 3 can be found, for example, in Reference [4].

[0059] The present disclosure proposes to further extend the EVS concept of coding voiced onsets using the glottal-shape codebook of the TC coding mode. When an attack occurs towards the end of a frame, it is proposed to force as much as possible use of the bit-budget (number of available bits) for coding the excitation toward the end of the frame, since coding of the preceding part of the frame (sub- frames before the sub-frame including the attack) with a low number of bits is sufficient. A difference with the TC coding mode of EVS as described in Reference [4] is that the glottal-shape codebook is usually used in the last sub-frame(s) within the frame, independently of the real maximum energy of the LP residual signal within the first pitch period of the frame.

[0060] By forcing most of the bit-budget for encoding the end of the frame, the waveform of the sound signal at the beginning of the frame might not be well modeled, especially at low bit-rates where the fixed codebook is formed of, for example, one or two pulses per sub-frame only. However, the human ear sensitivity is exploited here. The human ear is not much sensitive to an inaccurate coding of a sound signal before an attack, but much more sensitive to any imperfection in coding a sound signal segment, for example a voiced segment, after such attack. By forcing a larger number of bits to construct an attack, the adaptive codebook in subsequent sound signal frames is more efficient because it benefits from the past excitation corresponding to the attack segment that is well modeled. The subjective quality is consequently improved.

[0061] The present disclosure proposes a method for detecting an attack and a corresponding attack detector which operates on frames to be coded with the GC coding mode to determine if these frames should be encoded with the TC coding mode. Specifically, when an attack is detected, these frames are coded using the TC coding mode. Thus, the relative number of frames coded using the TC coding mode increases. Moreover, as the TC coding mode does not use the past excitation, the intrinsic robustness of the codec against frame erasures is increased with this approach.

Attack Detecting Method and Attack Detector

[0062] Figure 4 is a block diagram illustrating concurrently the operations of an attack detecting method 400 and the modules of an attack detector 450.

[0063] The attack detecting method 400 and attack detector 450 properly select frames to be coded using the TC coding mode. The following description describes, in connection with Figure 4, an example of attack detecting method 400 and attack detector 450 that can be used in a codec, in this illustrative example, a CELP codec with an internal sampling rate of 12.8 kbps and with a frame having a length of 20 ms and composed of four (4) sub-frames. An example of such codec is the EVS codec (Reference [4]) at lower bit-rates (< 13.2 kbps). An application to other types of codecs, with different internal bit-rates, frame lengths and numbers of sub-frames can also be contemplated.

[0064] The detection of attacks starts with a preprocessing where energies in several segments of the input sound signal in the current frame are calculated, followed by a detection performed sequentially in two stages and by a final decision. The first-stage detection is based on comparing calculated energies in the current frame while the second-stage detection takes into account also past frame energy values.

Energies of segments

[0065] In an energy calculating operation 401 of Figure 4, an energy calculator 451 calculate energy in a plurality of successive analysis segments of the perceptually weighted, input sound signal s_w(n), where n = 0, 1 ,..., N- and where N is the length of the frame in samples. To calculate such energy, the calculator 451 may use, for example, the following Equation (1 ):

where K is the length in samples of the analysis sound signal segment, / is the index of the segment, and NIK is the total number of segments. In the EVS Standard operating at an internal sampling rate of 12.8 kbps, the length of the frame is N = 256 samples and the length of the segment can be set to, for example, K = 8 which results in a total number of NIK = 32 analysis segments. Thus, segments / = 0, ...,7 correspond to the first sub-frame, segments i = 8, . . . , 15 to the second sub-frame, segments i = 16,...,23 to the third sub-frame, and finally segments i = 24,..., 31 to the last (fourth) sub-frame of the current frame. In the non-limitative illustrative example of Equation (1 ), the segments are consecutive. In another possible embodiment, partially overlapping segments can be employed.

[0066] Next, in a maximum energy segment finding operation 402, a maximum energy segment finder 452 finds the segment / with maximum energy. For that purpose, the finder 452 may use, for example, the following Equation (2):

[0067] The segment with maximum energy represents the position of a candidate attack which is validated in the following two stages (herein after first- stage and second-stage).

[0068] In the illustrative embodiments, given as example in the present description, only active frames (VAD=1 , where local VAD is considered in the current frame) previously classified for being processed using the GC coding mode are subject to the following first-stage and second-stage attack detection. Further explanations on VAC (Voice Activity Detection) can be found, for example, in Reference [4]. In a decision operation 403, a decision module 453 determines if VAD= 1 and the current frame has been classified for being processed using the GC coding mode. If yes, the first-stage attack detection is performed on the current frame. Otherwise, no attack is detected and the current frame is processed according to its previous classification as shown in Figure 3.

[0069] Both speech and music frames can be classified in the GC coding mode and, therefore, attack detection is applied in coding not only speech signals but general sound signals.

First-Stage Attack Detection

[0070] The first-stage attack detection operation 404 and the corresponding first-stage attack detector 454 will now be described with reference to Figure 4.

[0071] The first-stage attack detection operation 404 comprises an average energy calculating operation 405. To perform operation 405, the first-stage attack detector 454 comprises a calculator 455 of an average energy across the analysis segments before the last sub-frame in the current frame using, for example, the following Equation (3):

where P is the number of segments before the last sub-frame. In the non-limitative, example implementation, where NIK = 32, parameter P is equal to 24. [0072] Similarly, in average energy calculating operation 405, the calculator 455 calculates an average energy across the analysis segments starting with segment l_att to the last segment of the current frame, using as an example the following Equation (4):

[0073] The first-stage attack detection operation 404 further comprises a comparison operation 406. To perform the comparison operation 406, the first- stage attack detector 454 comprises a comparator 456 for comparing the ratio of the average energy E₁ from Equation (3) and the average energy E₂ from Equation (4) to a threshold depending on the signal classification of the previous frame, denoted as “last_class”, performed by the above discussed frame classification for Frame Error Concealment ( FEC ) (Reference [4]). The comparator 456 determines an attack position from the first-stage attack detection, l_att1, using as a non-limitative example, the following logic of Equation (5):

where b1 and b₂ are thresholds that can be set, according to the non-limitative example, to b1 = 8 and b₂ = 20, respectively. When l_att1 = 0, no attack is detected. Using the logic of Equation (5), all attacks that are not sufficiently strong are eliminated.

[0074] In order to further reduce the number of falsely detected attacks, the first-stage attack detection operation 404 further comprises a segment energy comparison operation 407. To perform the segment energy comparison operation 407, the first-stage attack detector 454 comprises a segment energy comparator 457 for comparing the segment with maximum energy E_seg(l_att ) with the energy E_Seg(i) of the other analysis segments of the current frame. Thus, if l_{att 1}> 0 as determined by the operation 406 and comparator 456, the comparator 457 performs, as a non-limitative example, the comparison of Equation (6) for i = 2,..., P-3:

where threshold b₃ is determined experimentally so as to reduce as much as possible falsely detected attacks without impeding on the efficiency of detection of true attacks. In a non-limitative experimental implementation, the threshold b₃ is set to 2. Again, when l_att1 = 0, no attack is detected.

Second-Stage Attack Detection

[0075] The second-stage attack detection operation 410 and the corresponding second-stage attack detector 460 will now be described with reference to Figure 4.

[0076] The second-stage attack detection operation 410 comprises a voiced class comparison operation 41 1 . To perform the voiced class comparison operation 41 1 , the second-stage attack detector 460 comprises a voiced class decision module 461 to get information from the above discussed EVS FEC classifying method to determine whether the current frame class is VOICED or not. If the current frame class is VOICED, the decision module 461 outputs the decision that no attack is detected.

[0077] If an attack was not detected in the first-stage attack detection operation 404 and first-stage attack detector 454 (specifically the comparison operation 406 and comparator 456 or the comparison operation 407 and comparator 457), i.e. l_{att 1}= 0, and the class of the current frame is other than VOICED, then the second-stage attack detection operation 410 and the second- stage attack detector 460 are applied.

[0078] The second-stage attack detection operation 410 comprises a mean energy calculating operation 412. To perform operation 412, the second-stage attack detector 460 comprises a mean energy calculator 462 for calculating a mean energy across N/K analysis segments before the candidate attack l_att - including segments from the previous frame - using for example Equation (7):

where E_seg,_Past(i) are energies per segments from the previous frame.

[0079] The second-stage attack detection operation 410 comprises a logic decision operation 413. To perform operation 413, the second-stage attack detector 460 comprises a logic decision module 463 to find an attack position from the second-stage attack detector, l_{att 2}, by applying, for example, the following logic of Equation (8) to the mean energy from Equation (7):

where l_att was found in Equation (2) and b₄ 5nd b₂ are thresholds being set, in this non-limitative example implementation, to b₄ = 16 and b₅ = 12, respectively. When the comparison operation 413 and comparator 463 determines that I_att2 = 0, no attack is detected.

[0080] The second-stage attack detection operation 410 finally comprises an energy comparison operation 414. To perform operation 414, the second-stage attack detector 460 comprises an energy comparator 464 to compare, in order to further reduce the number of falsely detected attacks when I_att2 as determined in the comparison operation 413 and comparator 463 is larger than 0, the following ratio with the following threshold as shown, for example, in Equation (9):

where b₆ is a threshold set to b₆ = 20 in this non-limitative example implementation, and E_LT is a long-term energy computed using, as a non-limitative example, Equation (10):

In this non-limitative example implementation, the parameter a is set to 0.95. Again, when I_att2 = 0, no attack is detected.

[0081] Finally, in the energy comparison operation 414, the energy comparator 464 set the attack position I_att2 to 0 if an attack was detected in the previous frame. In this case no attack is detected.

Final Attack Detection Decision

[0082] A final decision whether the current frame is determined as an attack frame to be coded using the TC coding mode is conducted based on the positions of the attacks l_att1 and I_att2 obtained during the first-stage 404 and second-stage 410 detection operations, respectively.

[0083] If the current frame is active (VAD=1 ) and previously classified for coding in the GC coding mode as determined in the decision operation 403 and decision module 453, the following logic of, for example, Equation (1 1 ) is applied:

[0084] Specifically, the attack detecting method 400 comprises a first-stage attack decision operation 430. To perform operation 430, if the current frame is active (VAD= 1 ) and previously classified for coding in the GC coding mode as determined in the decision operation 403 and decision module 453, the attack detector 450 further comprises a first-stage attack decision module 470 to determine if l_att1 ³ P. If l_att1 ³ P, then l_{att 1}is the position of the detected attack, l_att,fina in the last sub-frame of the current frame and is used to determine that the glottal- shape codebook of the TC coding mode is used in this last sub-frame. Otherwise, no attack is detected.

[0085] Regarding the second-stage attack detection, if the comparison of Equation (9) is true or if an attack was detected in the previous frame as determined in energy comparison operation 414 and energy comparator 464, then I_att2 = 0 and no attack is detected. Otherwise, in an attack decision operation 440 of the attack detecting method 400, an attack decision module 480 of the attack detector 450 determines that an attack is detected in the current frame at position l_att,fina = I_att2. The position of the detected attack, l_att,fina, is used to determine in which sub-frame the glottal-shape codebook of the TC coding mode is used. [0086] The information about the final position l_att,final of the detected attack is used to determine in which sub-frame of the current frame the glottal-shape codebook within the TC coding mode is employed and which TC mode configuration (see Reference [3]) is used. For example, in case of a frame of N = 256 samples which is divided into four (4) sub-frames and N/K = 32 analysis segments, the glottal-shape codebook is used in the first sub-frame if the final attack position l_att,final is detected in segments 1 -7, in the second sub-frame if the final attack position l_att,final is detected in segments 8-15, in the third sub-frame if the final attack position l_att,final is detected in segments 16-23, and finally in the last (fourth) sub-frame of the current frame if the final attack position l_att,final is detected in segments 24-31. The value l_att,final = 0 signals that an attack was not found and that the current frame is coded according to the original classification (usually using the GC coding mode).

Illustrative implementation in an immersive voice/audio codec

[0087] The attack detecting method 400 comprises a glottal-shape codebook assignment operation 445. To perform operation 445, the attack detector 450 comprises a glottal-shape codebook assignment module 485 to assign the glottal-shape codebook within the TC coding mode to a given sub-frame of the current frame consisted from 4 sub-frames using the following logic of Equation (12):

where sbfr is the sub-frame index, sbfr = 0....3, where index 0 denotes the first sub- frame, index 1 denotes the second sub-frame, index 2 denotes the third sub-frame, and index 3 denotes the fourth sub-frame. [0088] The foregoing description of a non-limitative example of implementation supposes a pre-processing module operating at an internal sampling rate of 12.8 kHz, having four (4) sub-frames and thus frames having a number of samples N = 256. If the core codec uses ACELP at the internal sampling rate of 12.8 kHz, the final attack position l_att,final is assigned to the sub-frame as defined in Equation (12). However, the situation is different when the core codec operates at a different internal sampling rate, for example at higher bit-rates (16.4 kbps and more in the case of EVS) where the internal sampling rate is 16 kHz. Giving a frame length of 20 ms, the frame is composed in this case of 5 sub-frames and the length of such frame is N₆ = 320 samples. In this example of implementation, since the pre-processing classification and analysis might be still performed in the 12.8 kHz internal sampling rated domain, the glottal-shape codebook assignment module 485 selects, in the glottal-shape codebook assignment operation 445, the sub-frame to be coded using the glottal-shape codebook within the TC coding mode using the following logic of Equation (13):

where the operator

indicates the largest integer less than or equal to x. In the case of Equation (13), sbfr = 0....4 is different from Equation (12) while the number of analysis segments is the same as in Equation (12), i.e. NIK = 32. Thus the glottal-shape codebook is used in the first sub-frame if the final attack position l_att,final is detected in segments 1-6, in the second sub-frame if the final attack position l_att,final is detected in segments 7-12, in the third sub-frame if the final attack position l_att,final is detected in segments 13-19, in the fourth sub-frame if the final attack position l_att,final is detected in segments 20-25, and finally in the last (fifth) sub-frame of the current frame if the final attack position l_att,final is detected in segments 26-31. [0089] Figure 5 is a graph of a first non-restrictive, illustrative example showing the impact of the attack detector of Figure 4 and TC coding mode on the quality of a decoded music signal. Specifically, in Figure 5, a music segment of castanets is shown, wherein curve a) represents the input (uncoded) music signal, curve b) represents a decoded reference signal synthesis when only the first-stage attack detection was employed, and curve c) represents the decoded improved synthesis when the whole first-stage and second-stage attack detections and coding using the TC coding mode are employed. Comparing curves b) and c), it can be seen that the attacks (low-to-high amplitude onsets such as 500 in Figure 5) in the synthesis of curve c) are reconstructed significantly more accurate both in terms of preserving the energy and sharpness of the castanets signal at the beginning of onsets.

[0090] Figure 6 is a graph of a second non-restrictive, illustrative example showing the impact of the attack detector of Figure 4 and TC coding mode on the quality of a decoded speech signal, wherein curve a) represents an input (uncoded) speech signal, curve b) represents a decoded reference speech signal synthesis when an onset frame is coded using the GC coding mode, and curve c) represents a decoded improved speech signal synthesis when the whole first-stage and second-stage attack detection and coding using the TC coding mode are employed in the onset frame. Comparing curves b) and c), it can be seen that coding of the attacks (low-to-high amplitude onsets such as 600 in Figure 6) is improved when the attack detection operation 400 and attack detector 450 and the TC coding mode are employed in the onset frame. Moreover, the frame after onset is coded using the GC coding mode both in curves b) and c) and it can be seen that the coding quality of the frame after onset is also improved in curve c). This is because the adaptive codebook in the GC coding mode in the frame after onset takes advantage of the well built excitation when the onset frame is coded using the TC coding mode.

[0091] Figure 7 is a simplified block diagram of an example configuration of hardware components forming the devices for detecting an attack in a sound signal to be coded and for coding the detected attack and implementing the methods for detecting an attack in a sound signal to be coded and for coding the detected attack.

[0092] The devices for detecting an attack in a sound signal to be coded and for coding the detected attack may be implemented as a part of a mobile terminal, as a part of a portable media player, or in any similar device. The devices for detecting an attack in a sound signal to be coded and for coding the detected attack (identified as 700 in Figure 7) comprises an input 702, an output 704, a processor 706 and a memory 708.

[0093] The input 702 is configured to receive for example the digital input sound signal 105 (Figure 1 ). The output 704 is configured to supply the encoded bit-stream 1 1 1. The input 702 and the output 704 may be implemented in a common module, for example a serial input/output device.

[0094] The processor 706 is operatively connected to the input 702, to the output 704, and to the memory 708. The processor 706 is realized as one or more processors for executing code instructions in support of the functions of the various modules of the sound encoder 106, including the modules of Figures 2, 3 and 4.

[0095] The memory 708 may comprise a non-transient memory for storing code instructions executable by the processor 706, specifically a processor- readable memory comprising non-transitory instructions that, when executed, cause a processor to implement the operations and modules of the sound encoder 106, including the operations and modules of Figures 2, 3 and 4. The memory 708 may also comprise a random access memory or buffer(s) to store intermediate processing data from the various functions performed by the processor 706.

[0096] Those of ordinary skill in the art will realize that the descriptions of the methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack are illustrative only and are not intended to be in any way limiting. Other embodiments will readily suggest themselves to such persons with ordinary skill in the art having the benefit of the present disclosure. Furthermore, the disclosed methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack may be customized to offer valuable solutions to existing needs and problems related to allocation or distribution of bit-budget.

[0097] In the interest of clarity, not all of the routine features of the implementations of the methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack are shown and described. It will, of course, be appreciated that in the development of any such actual implementation of the methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack, numerous implementation- specific decisions may need to be made in order to achieve the developer’s specific goals, such as compliance with application-, system-, network- and business- related constraints, and that these specific goals will vary from one implementation to another and from one developer to another. Moreover, it will be appreciated that a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking of engineering for those of ordinary skill in the field of sound processing having the benefit of the present disclosure.

[0098] In accordance with the present disclosure, the modules, processing operations, and/or data structures described herein may be implemented using various types of operating systems, computing platforms, network devices, computer programs, and/or general purpose machines. In addition, those of ordinary skill in the art will recognize that devices of a less general purpose nature, such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used. Where a method comprising a series of operations and sub-operations is implemented by a processor, computer or a machine, and those operations and sub-operations may be stored as a series of non-transitory code instructions readable by the processor, computer or machine, they may be stored on a tangible and/or non-transient medium.

[0099] Modules of the methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack as described herein may comprise software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein.

[00100] In the methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack as described herein, the various operations and sub-operations may be performed in various orders and some of the operations and sub-operations may be optional.

[00101] Although the present, foregoing disclosure is made by way of non-restrictive, illustrative embodiments, these embodiments may be modified at will within the scope of the appended claims without departing from the spirit and nature of the present disclosure.

REFERENCES

[00102] The following references are referred to in the present specification and the full contents thereof are incorporated herein by reference.

[1] V. Eksler, R. Salami, and M. Jelinek, "Efficient handling of mode switching and speech transitions in the EVS codec," in Proc. IEEE Int Conf. on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 2015.

[2] V. Eksler, M. Jelinek, and R. Salami, "Method and Device for the

Encoding of Transition Frames in Speech and Audio," WIPO Patent Application No. WO/2008/049221 , 24 Oct. 2006. [3] V. Eksler and M. Jelinek, "Glottal-Shape Codebook to Improve Robustness of CELP Codecs," IEEE Trans on Audio, Speech and Language Processing, vol. 18, no. 6, pp. 1208 - 1217, Aug. 2010.

[4] 3GPP TS 26.445: "Codec for Enhanced Voice Services (EVS);

Detailed Algorithmic Description".

As additional disclosure, the following is the pseudo-code of a non-limitative example of the disclosed attack detector implemented in an Immersive Voice and Audio Services (IVAS) codec®

The pseudo-code is based on EVS. New IVAS logic is highlighted in shaded background.

Claims

WHAT IS CLAIMED IS:

1 . A device for detecting an attack in a sound signal to be coded wherein the sound signal is processed in successive frames each including a number of sub- frames, comprising:

a first-stage attack detector for detecting the attack in a last sub-frame of a current frame; and

a second-stage attack detector for detecting the attack in one of the sub- frames of the current frame, including the sub-frames preceding the last sub-frame.

2. An attack detecting device according to claim 1 , comprising a decision module for determining that the current frame is an active frame previously classified to be coded using a generic coding mode, and for indicating that no attack is detected when the current frame is not determined as an active frame previously classified to be coded using a generic coding mode.

3. An attack detecting device according to claim 1 or 2, comprising:

a calculator of an energy of the sound signal in a plurality of analysis segments in the current frame; and

a finder of one of the analysis segments with maximum energy representing a candidate attack position to be validated by the first-stage and second-stage attack detectors.

4. An attack detecting device according to claim 3, wherein the first-stage attack detector comprises:

a calculator of a first average energy across the analysis segments before the last sub-frame in the current frame; and

a calculator of a second average energy across the analysis segments of the current frame starting with the analysis segment with maximum energy to the last analysis segment of the current frame.

5. An attack detecting device according to claim 4, wherein the first-stage attack detector comprises: a first comparator of a ratio between the first average energy and the second average energy to:

- a first threshold; or

- a second threshold when a classification of a previous frame is VOICED.

6. An attack detecting device according to claim 5, wherein the first-stage attack detector comprises, when the comparison by the first comparator indicates that a first-stage attack is detected:

a second comparator of a ratio between the energy of the analysis segment of maximum energy and the energy of other analysis segments of the current frame with a third threshold.

7. An attack detecting device according to claim 6, comprising, when the comparisons by the first and second comparators indicate that a first-stage attack position is the analysis segment with maximum energy representing a candidate attack position:

a decision module for determining if the first-stage attack position is equal to or larger than a number of analysis segments before the last sub-frame of the current frame and, if the first-stage attack position is equal to or larger than the number of analysis segments before the last sub-frame, determining the position of the detected attack as the first-stage attack position in the last sub-frame of the current frame.

8. An attack detecting device according to any one of claims 1 to 7, wherein the second-stage attack detector is used when no attack is detected by the first- stage attack detector.

9. An attack detecting device according to claim 8, comprising a decision module for determining if the current frame is classified as VOICED, and wherein the second-stage attack detector is used when the current frame is not classified as VOICED.

10. An attack detecting device according to claim 8 or 9, wherein the second- stage attack detector comprises a calculator of a mean energy of the sound signal across analysis segments before the analysis segment with maximum energy representing a candidate attack position.

1 1 . An attack detecting device according to claim 10, wherein the analysis segments before the analysis segment with maximum energy representing a candidate attack position comprises analysis segments from a previous frame.

12. An attack detecting device according to claim 1 1 , wherein the second-stage attack detector comprises:

a first comparator of a ratio between the energy of the analysis segment representing a candidate attack position and the calculated mean energy to:

- a first threshold; or

- a second threshold when a classification of the previous frame is UNVOICED.

13. An attack detecting device according to claim 12, wherein the second-stage attack detector comprises, when the comparison by the first comparator of the second-stage attack detector indicates that a second-stage attack is detected: a second comparator of a ratio between the energy of the analysis segment representing a candidate attack position and a long-term energy of the analysis segments to a third threshold.

14. An attack detecting device according to claim 13, wherein the second comparator of the second-stage attack detector detects no attack when an attack was detected in the previous frame.

15. An attack detecting device according to claim 13, comprising, when the comparisons by the first and second comparators of the second-stage attack detector indicates that a second-stage attack position is the analysis segment with maximum energy representing a candidate attack position: a decision module for determining the position of the detected attack as the second-stage attack position.

16. A device for coding an attack in a sound signal, comprising:

the attack detecting device according to any one of claims 1 to 15; and an encoder of the sub-frame comprising the detected attack using a coding mode with a non-predictive codebook.

17. An attack coding device according to claim 16, wherein the coding mode is a transition coding mode.

18. An attack coding device according to claim 17, wherein the non-predictive codebook is a glottal-shape codebook populated with glottal impulse shapes.

19. An attack coding device according to claim 17, wherein the attack detecting device determines the sub-frame coded with the transition coding mode based on the position of the detected attack.

20. A device for detecting an attack in a sound signal to be coded wherein the sound signal is processed in successive frames each including a number of sub- frames, comprising:

at least one processor; and

a memory coupled to the processor and comprising non-transitory instructions that when executed cause the processor to implement:

a first-stage attack detector for detecting the attack in a last sub- frame of a current frame; and

a second-stage attack detector for detecting the attack in a sub- frame of the current frame preceding the last sub-frame.

21. A device for detecting an attack in a sound signal to be coded wherein the sound signal is processed in successive frames each including a number of sub- frames, comprising: at least one processor; and

a memory coupled to the processor and comprising non-transitory instructions that when executed cause the processor to:

detect, in a first-stage, the attack positioned in a last sub-frame of a current frame; and

detect, in a second-stage, the attack positioned in a sub-frame of the current frame preceding the last sub-frame.

22. A method for detecting an attack in a sound signal to be coded wherein the sound signal is processed in successive frames each including a number of sub- frames, comprising:

a first-stage attack detection for detecting the attack in a last sub-frame of a current frame; and

a second-stage attack detection for detecting the attack in one of the sub- frames of the current frame, including the sub-frames preceding the last sub-frame.

23. An attack detecting method according to claim 22, comprising determining that the current frame is an active frame previously classified to be coded using a generic coding mode, and indicating that no attack is detected when the current frame is not determined as an active frame previously classified to be coded using a generic coding mode.

24. An attack detecting method according to claim 22 or 23, comprising:

calculating an energy of the sound signal in a plurality of analysis segments in the current frame; and

finding one of the analysis segments with maximum energy representing a candidate attack position to be validated by the first-stage and second-stage attack detections.

25. An attack detecting method according to claim 24, wherein the first-stage attack detection comprises:

calculating a first average energy across the analysis segments before the last sub-frame in the current frame; and

calculating a second average energy across the analysis segments of the current frame starting with the analysis segment with maximum energy to the last analysis segment of the current frame.

26. An attack detecting method according to claim 25, wherein the first-stage attack detection comprises:

comparing, using a first comparator, a ratio between the first average energy and the second average energy to:

- a first threshold; or

- a second threshold when a classification of a previous frame is VOICED.

27. An attack detecting method according to claim 26, wherein the first-stage attack detection comprises, when the comparison by the first comparator indicates that a first-stage attack is detected:

comparing, using a second comparator, a ratio between the energy of the analysis segment of maximum energy and the energy of other analysis segments of the current frame with a third threshold.

28. An attack detecting method according to claim 27, comprising, when the comparisons by the first and second comparators indicate that a first-stage attack position is the analysis segment with maximum energy representing a candidate attack position:

determining if the first-stage attack position is equal to or larger than a number of analysis segments before the last sub-frame of the current frame and, if the first-stage attack position is equal to or larger than the number of analysis segments before the last sub-frame, determining the position of the detected attack as the first-stage attack position in the last sub-frame of the current frame.

29. An attack detecting method according to any one of claims 22 to 28, wherein the second-stage attack detection is used when no attack is detected by the first-stage attack detector.

30. An attack detecting method according to claim 29, comprising determining if the current frame is classified as VOICED, wherein the second-stage attack detection is used when the current frame is not classified as VOICED.

31 . An attack detecting method according to claim 29 or 30, wherein the second-stage attack detection comprises calculating a mean energy of the sound signal across analysis segments before the analysis segment with maximum energy representing a candidate attack position.

32. An attack detecting method according to claim 31 , wherein the analysis segments before the analysis segment with maximum energy representing a candidate attack position comprises analysis segments from a previous frame.

33. An attack detecting method according to claim 32, wherein the second- stage attack detection comprises:

comparing, using a first comparator, a ratio between the energy of the analysis segment representing a candidate attack position and the calculated mean energy to:

- a first threshold; or

- a second threshold when a classification of the previous frame is UNVOICED.

34. An attack detecting method according to claim 33, wherein the second- stage attack detection comprises, when the comparison by the first comparator of the second-stage attack detection indicates that a second-stage attack is detected: comparing, using a second comparator, a ratio between the energy of the analysis segment representing a candidate attack position and a long-term energy of the analysis segments to a third threshold.

35. An attack detecting method according to claim 34, wherein the comparison by the second comparator of the second-stage attack detection detects no attack when an attack was detected in the previous frame.

36. An attack detecting method according to claim 34, comprising, when the comparisons by the first and second comparators of the second-stage attack detection indicates that a second-stage attack position is the analysis segment with maximum energy representing a candidate attack position:

determining the position of the detected attack as the second-stage attack position.

37. A method for coding an attack in a sound signal, comprising:

the attack detecting method according to any one of claims 22 to 36; and encoding the sub-frame comprising the detected attack using a coding mode with a non-predictive codebook.

38. An attack coding method according to claim 37, wherein the coding mode is a transition coding mode.

39. An attack coding method according to claim 38, wherein the non-predictive codebook is a glottal-shape codebook populated with glottal impulse shapes.

40. An attack coding method according to claim 38, comprising determining the sub-frame coded with transition coding mode based on the position of the detected attack.