WO2013017018A1

WO2013017018A1 - Method and apparatus for performing voice adaptive discontinuous transmission

Info

Publication number: WO2013017018A1
Application number: PCT/CN2012/078878
Authority: WO
Inventors: 顾彩霞; 袁浩; 江东平; 黎家力
Original assignee: 中兴通讯股份有限公司
Priority date: 2011-07-29
Filing date: 2012-07-19
Publication date: 2013-02-07
Also published as: CN102903364B; CN102903364A

Abstract

A method and an apparatus for performing voice adaptive discontinuous transmission. The method comprises: when voice adaptive discontinuous transmission is performed, determining whether to send a silence insertion descriptor frame according to frequency spectrum information of a current voice signal frame and frequency spectrum information of a previous silence insertion descriptor frame. This scheme may overcome disadvantages that in related technologies, with a fixed interval manner, signal changes cannot be flexibly tracked, and with a variable interval manner, necessary multi-parameter calculation such as linear prediction results in high calculation complexity. This scheme is directly performed in a frequency domain, and can well track signal changes, thereby ensuring the tone quality at the same time of maintaining a low average code rate.

Description

Method and device for performing speech adaptive discontinuous transmission

Technical field

The present invention relates to the field of digital signal processing, and in particular, to a method and apparatus for performing speech adaptive discontinuous transmission (DTX).

Background technique

In the actual user communication process, in general, less time is used to transmit the user's voice, and more time is used to transmit the non-voice background sound. If the communication process is fully encoded according to the encoding method of the voice signal, a great waste of resources is caused. In the related art, in order to reduce such waste, the sender uses the Voice Activity Detector (VAD) algorithm for signal detection, and when detecting the inactive segment of the call, the lower code rate pair is used in the silence segment. The important information of the signal is encoded, that is, the signal is coded into a Silence Insertion Descriptor (SID) frame, and the SID frame is transmitted in a discontinuous manner. The decoding end decodes according to the received SID frame in the form of Comfort Noise Generation (CNG). In this way, on the basis of little influence on the sound quality, the average code rate is greatly reduced, and resources are saved, which is undoubtedly positive for effectively using increasingly tight network bandwidth resources. Therefore, what kind of strategy and how much interval to use to send SID frames in the silent segment determines how much bandwidth is saved.

Currently, there are two main methods for performing SID frame transmission in voice adaptive discontinuous transmission: one, transmitting at a fixed interval; and two: transmitting at a variable interval.

When the scheme is transmitted at a fixed interval in the first mode, a SID frame is transmitted at a certain number of frames in the mute segment by using a parameter set in advance, for example, the 3GPP AMR and the AMR-WB speech coding standard are used. Method, fixed once every 8 frames. The advantage of this method is that the calculation is simple and easy to implement, and the disadvantage is that the code rate cannot be automatically adjusted according to the signal characteristics.

In the SID frame transmission mechanism of Adaptive Multi Rate (AMR), when the sender detects a silence frame after a voice frame, it does not immediately enter the silence segment, but uses a certain hangover mechanism. In this buffering phase, the encoding of the normal speech is still encoded. After the buffering phase, the silence frame is still detected, then the SIDFIRST frame (ie the first SID frame) is sent at the first silence frame position after the silence segment, and a SID update (SIDUPDATE) frame is sent at the third silence frame position. After that, a SID update frame is sent every 7 frames, so that the SID frame is updated with a fixed low code rate after the buffering phase, so as to update the parameters. In another implementation, when a silence frame is detected after consecutive N speech frames and the value of N is less than 34, the buffering phase is canceled, and the SID update frame is directly transmitted. This method is simple to calculate and can be implemented only by using a counter. No additional parameter calculation is required, and the code rate is controllable, and the algorithm is stable. The disadvantage of this method is that the fixed interval is used to make the code rate fixed, and the uniform code rate is used for different noises, and cannot be adjusted according to the change of the noise signal. For example, for white noise, the parameters are very stable, but the SID frame is still sent frequently, which cannot effectively reduce the code rate. For a fast-changing noise signal, the signal change cannot be tracked in time, causing information delay, resulting in a large distortion of the noise signal when the CNG is restored at the decoding end.

When using the variable interval transmission scheme of mode 2, a certain algorithm is used to evaluate the signal of the silent segment in real time, and according to the real-time change of the signal, it is determined whether the SID frame needs to be transmitted. The advantage of this method is flexibility, it can be changed according to the real-time change of the signal, the bandwidth is saved to the maximum, and the average code rate can be adjusted. The disadvantage is that the calculation is relatively complicated.

In the ITU-T G.729 speech coder, the variable interval transmission mode is used to measure whether the signal changes significantly by calculating the parameters such as LPC of the signal to determine whether the update is needed, although the method can be adaptive. The signal is tracked, but the computational complexity is high. This method is based on linear prediction. First, the linear predictive coding (LPC) of the signal is obtained to obtain the linear prediction parameter a and the residual energy E of the signal, and then the mathematical representation of the coefficient is used, and the same parameter of the last transmitted SID frame stored in the memory. To compare, if any of the LPC envelopes or energy comparison results are greater than a certain threshold, then the signal is considered to change, then the SID update frame is sent, otherwise it is not sent. Since the method is performed in the time domain, the LPC analysis of the signal must first be performed, and the calculation is complicated. And the true reflection degree of the LPC coefficient on the signal spectrum depends on the order of the LPC, and the order of the LPC is directly proportional to the computational complexity. In addition, the residual energy of the signal or the LPC envelope is separately detected, and it is difficult to reflect the change of the signal as a whole. For example, if the description of the frame signal by the LPC is inaccurate, it directly leads to a relatively large change in the residual energy of the signal. Summary of the invention

Embodiments of the present invention provide a method and apparatus for performing speech adaptive discontinuous transmission, which overcomes the problem that the fixed interval method in the related art cannot flexibly track signal changes, and the variable interval method must have multiple parameters such as linear prediction. The calculations lead to the disadvantage of high computational complexity.

In order to solve the above technical problem, an embodiment of the present invention provides a method for performing voice adaptive discontinuous transmission, including:

In performing speech adaptive discontinuous transmission, whether to send a mute insertion description frame is determined according to the current speech signal frame and the spectrum information of the previous mute insertion description frame.

The spectrum information of the voice signal frame refers to the spectrum information calculated according to the frequency domain signal of the voice signal frame, or the frequency domain signal of the voice signal frame is smoothed and processed according to the smoothed frequency domain signal. Calculated frequency information.

The step of determining whether to send the mute insertion description frame according to the current speech signal frame and the frequency information of the previous mute insertion description frame includes:

Determining an absolute value of a spectral energy of the speech signal frame and/or an absolute value of a spectral energy of the last mute insertion description frame is greater than a single frame energy threshold, and a spectral energy of the speech signal frame and a previous mute insertion description When the difference in the spectral energy of the frame is greater than the first preset limit, the mute insertion description frame is sent.

The step of determining whether to send the silence insertion description frame according to the current speech signal frame and the frequency information of the previous mute insertion description frame includes:

Determining an absolute value of a spectral energy of the speech signal frame and/or an absolute value of a spectral energy of the last mute insertion description frame is greater than a single frame energy threshold, and a spectral energy of the speech signal frame and the previous muting When the difference between the spectral energy of the description frame is greater than the first preset limit, further determining whether the difference between the spectral energy of the speech signal frame and the spectral energy of the previous mute insertion description frame is greater than a second preset limit, if The two mute insertion description frames are continuously sent, wherein the second preset limit corresponds to a spectral energy difference greater than a spectral energy difference corresponding to the first preset limit.

The difference between the spectral energy of the speech signal frame and the spectral energy of the previous mute insertion description frame is greater than a preset limit:

The ratio of the spectral energy of the speech signal frame to the spectral energy of the last mute insertion description frame is large a ratio threshold corresponding to the preset limit or less than a reciprocal of the ratio threshold, wherein the ratio threshold is a real number greater than one;

Or,

The difference between the spectral energy of the speech signal frame and the spectral energy of the last mute insertion description frame is greater than the difference threshold.

Determining the absolute value of the spectral energy of the speech signal frame and/or the absolute value of the spectral energy of the last mute insertion description frame is greater than the single frame energy threshold, calculating the speech signal frame and the previous mute insertion description frame The frequency-dependent value of the spectral energy, when it is judged that the calculated frequency-related value is less than the spectral correlation threshold, sends a mute insertion description frame.

In order to solve the above technical problem, an embodiment of the present invention further provides an apparatus for performing voice adaptive non-contiguous transmission, including a mute insertion description frame processing unit and a mute insertion description frame storage unit;

The mute insertion description frame processing unit is configured to determine whether to send a mute insertion description frame according to the current speech signal frame and the spectrum information of the last mute insertion description frame;

The mute insertion description frame storage unit is configured to store the spectrum information of the mute insertion description frame after the mute insertion description frame processing unit transmits the mute insertion description frame.

The mute insertion description frame processing unit is further configured to perform smoothing processing on the frequency domain signal of the speech signal frame, and calculate the frequency information of the speech signal frame according to the smoothed frequency domain signal;

The mute insertion description frame storage unit is further arranged to store the smoothed frequency domain signal.

The mute insertion description frame processing unit is configured to decide whether to transmit a mute insertion description frame by: determining an absolute value of a spectral energy of the speech signal frame and/or an absolute value of a spectral energy of the last mute insertion description frame When the value is greater than the single frame energy threshold, and the difference between the spectral energy of the speech signal frame and the spectral energy of the previous mute insertion description frame is greater than the first preset limit, the mute insertion description frame is sent; or the speech signal frame is determined. Absolute value of the spectral energy and / or the The absolute value of the spectral energy of the previous mute insertion description frame is greater than the single frame energy threshold, and the difference between the spectral energy of the speech signal frame and the spectral energy of the previous mute insertion description frame is greater than the first preset limit, further Determining whether a difference between a spectral energy of the speech signal frame and a spectral energy of the previous mute insertion description frame is greater than a second preset limit, and if so, continuously transmitting two mute insertion description frames, wherein the second preset limit corresponds to The spectral energy difference is greater than the spectral energy difference corresponding to the first preset limit;

The difference between the spectral energy of the speech signal frame and the spectral energy of the previous mute insertion description frame is greater than a preset limit: the spectral energy of the speech signal frame and the spectral energy of the previous mute insertion description frame The ratio is greater than a ratio threshold corresponding to the preset limit or less than a reciprocal of the ratio threshold, wherein the ratio threshold is a real number greater than 1; or the frequency energy of the speech signal frame and the spectrum of the previous muting insertion description frame The absolute difference in energy is greater than the difference threshold.

The mute insertion description frame processing unit is configured to decide whether to transmit a mute insertion description frame by: determining an absolute value of a spectral energy of the speech signal frame and/or an absolute value of a spectral energy of the last mute insertion description frame When the value is greater than the single-frame energy threshold, the frequency-correlation value of the spectral energy of the speech signal frame and the previous mute insertion description frame is calculated, and when the calculated frequency-related value is less than the spectral correlation threshold, the mute insertion description frame is sent. .

This solution can overcome the shortcomings of the related art that the fixed interval method cannot flexibly track the signal change, and the variable interval method must have linear parameters and other multi-parameter calculations, resulting in high computational complexity. This scheme is directly implemented in the frequency domain, which can well track the change of the signal and ensure the sound quality while maintaining a low average code rate.

BRIEF abstract

1 is a schematic structural diagram of an apparatus for performing voice adaptive discontinuous transmission;

2 is a schematic structural diagram of another apparatus for performing voice adaptive discontinuous transmission; FIG. 3 is a schematic flowchart of performing voice adaptive discontinuous transmission in Embodiment 2; FIG. 4 is a voice adaptive method in Embodiment 3. Schematic diagram of the process of discontinuous transmission. Preferred embodiment of the invention

As shown in Fig. 1, the apparatus for performing voice adaptive discontinuous transmission includes a mute insertion description frame processing unit and a mute insertion description frame storage unit.

The mute insertion description frame processing unit is configured to determine whether to send the mute insertion description frame according to the current speech signal frame and the frequency information of the previous mute insertion description frame;

The mute insertion description frame storage unit is arranged to store the frequency information of the mute insertion description frame after the device transmits the mute insertion description frame.

In the first embodiment, the mute insertion description frame processing unit is configured to determine whether to send the mute insertion description frame by: determining the absolute value of the spectral energy of the speech signal frame and/or the spectrum of the previous mute insertion description frame. The absolute value of the energy is greater than the single frame energy threshold, and when the difference between the spectral energy of the speech signal frame and the spectral energy of the last mute insertion description frame is greater than the first preset limit, the mute insertion description frame is sent.

The mute insertion description frame processing unit may be further configured to decide whether to transmit the mute insertion description frame by: determining an absolute value of the spectral energy of the speech signal frame and/or an absolute value of the spectral energy of the last mute insertion description frame If the difference between the spectral energy of the speech signal frame and the spectral energy of the previous mute insertion description frame is greater than the first predetermined limit, further determining the spectral energy of the speech signal frame and the The previous mute insertion describes whether the difference value of the spectral energy of the frame is greater than the second preset limit. If yes, two mute insertion description frames are continuously sent, where the spectral energy difference corresponding to the second preset limit is greater than the first preset limit. The spectral energy gap.

Wherein, the difference between the spectral energy of the speech signal frame and the spectral energy of the previous mute insertion description frame is greater than a preset limit:

The ratio of the spectral energy of the speech signal frame to the spectral energy of the previous mute insertion description frame is greater than a ratio threshold corresponding to the preset limit or less than a reciprocal of the ratio threshold, wherein the ratio threshold is a real number greater than 1; or, the speech signal frame The absolute value of the difference between the spectral energy and the spectral energy of the last mute insertion description frame is greater than the difference threshold.

In the second embodiment, the mute insertion description frame processing unit is configured to decide whether to send the mute insertion description frame by: determining the absolute value of the spectral energy of the speech signal frame and/or the upper When the absolute value of the frequency speech energy of the mute insertion description frame is greater than the single frame energy threshold, the frequency correlation value of the frame is calculated according to the current speech signal frame and the spectrum energy of the previous mute insertion description frame, and the spectrum correlation value is determined. When less than the spectral correlation threshold, the mute insertion description frame is sent.

In Embodiment 3, the mute insertion description frame processing unit is configured to determine whether to transmit the mute insertion description frame by the difference of the spectrum energy of the two and the frequency correlation value.

As shown in FIG. 2, the apparatus may further include: a smoothing filtering unit; the smoothing filtering unit is configured to perform smoothing filtering on the frequency domain signal of the voice signal, and input to the mute insertion description frame processing unit, and the mute insertion description frame processing unit The above processing is performed on the smoothed frequency domain signal, and the mute insertion description frame storage unit also needs to save the smoothed frequency domain signal.

The method for performing voice adaptive discontinuous transmission includes: In performing voice adaptive discontinuous transmission, determining whether to send a silence insertion description frame according to a current voice signal frame and a frequency information of a previous silence insertion description frame.

The smoothing process is mainly to more accurately compare the spectral changes of the signal, reduce the influence of the details of the spectrum on the overall comparison, eliminate the spectral spikes and burrs, and make the output spectrum smoother, making the spectral envelope more stable. This spectral smoothing can be achieved using a smoothing filter. Take 16kHz sample and 20ms frame length as an example. By using a fast Fourier transform (FFT), the time domain signal is transformed into the frequency domain to obtain the spectral parameters of the frame signal, and the FFT length is 320 points. The following smoothing filters can be used:

H(z) = a ₀ Z~ ² + α _λ Ζ~ ^ι + ₂ + α ₃ Ζ + α ₄ Ζ ² where the coefficients [ , A , ^α ^ , ] are the smoothing coefficients, which can be [0.15, 0.15, 0.4, 0.15, 0.15]. After smoothing, the trend of the line is unchanged, but the instantaneous mutation is reduced, which is more conducive to observing the change of the signal envelope of the signal. The above spectral smoothing includes, but is not limited to, the above-described manner of using a filter. During the use of the filter, different adjustment effects can also be achieved by adjusting the coefficients or orders of the filter. In Embodiment 1, determining an absolute value of a spectral energy of the speech signal frame and/or an absolute value of a spectral energy of the last mute insertion description frame is greater than a single frame energy threshold, and a spectral energy sum of the speech signal frame When the previous mute insertion describes that the difference in the spectral energy of the frame is greater than the first preset limit, the mute insertion description frame is sent.

Or determining an absolute value of the spectral energy of the speech signal frame and/or an absolute value of the spectral energy of the last mute insertion description frame is greater than a single frame energy threshold, and the spectral energy of the speech signal frame and the upper And determining, by a mute insertion, that the difference between the spectral energy of the frame is greater than the first preset limit, further determining whether a difference between the spectral energy of the speech signal frame and the spectral energy of the previous mute insertion description frame is greater than a second preset limit, If yes, two mute insertion description frames are continuously sent, wherein the second preset limit corresponds to a spectral energy difference greater than a spectral energy difference corresponding to the first preset limit.

In Embodiment 2, when determining the absolute value of the spectral energy of the speech signal frame and/or the absolute value of the frequency speech energy of the last mute insertion description frame is greater than the single frame energy threshold, according to the current speech signal frame and the upper A mute insertion describes a frequency-correlation value of the spectral energy of the frame, and when the frequency-related value is less than the frequency-dependent threshold, the mute insertion description frame is sent.

In the third embodiment, whether the mute insertion description frame is sent may be determined according to the difference of the spectrum energy of the two and the frequency correlation value.

The details will be described below by way of specific examples.

Specific embodiment 1

In this embodiment, the frequency word correlation value parameter is used for judgment.

After the SID frame is sent, the device stores the spectrum energy information of the SID frame in the SID frame storage unit, that is, the information stored in the silence insertion description frame storage unit is the last transmission. Spectrum energy information of the SID frame.

When determining whether to send the SID frame, first determining that at least one of the absolute value of the spectral energy of the current speech signal frame and the absolute value of the spectral energy of the previous mute insertion description frame is greater than a single frame energy threshold (THR1), if not satisfied In the above condition, the signal execution is considered to maintain low energy, and the SID frame does not need to be transmitted. After the above conditions are satisfied, the correlation between the spectral energy of the current speech signal frame and the spectral energy of the previous mute insertion description frame is calculated according to the following formula:

∑ ys _last )

Where S(i) represents the spectral energy of the current speech signal frame, S _last (i) represents the spectral energy of the previous SID frame of the current frame, and N represents the spectral length, which is 320 in this embodiment.

If the absolute value of the frequency-dependent value in the above equation is less than the frequency-dependent threshold (THR2), it is determined that the SID frame needs to be transmitted while updating the information of the SID frame storage unit.

Specific embodiment 2

In this embodiment, the ratio of the spectral energy is used to determine.

After the SID frame is sent, the device stores the spectrum energy information of the SID frame in the SID frame storage unit, that is, the information stored in the silence insertion description frame storage unit is the spectrum energy information of the last transmitted SID frame.

As shown in FIG. 3, when determining whether to send a SID frame, first determining that at least one of an absolute value of a spectral energy of a current speech signal frame and an absolute value of a spectral energy of a previous mute insertion description frame is greater than a single frame energy threshold, If the above conditions are not met, the signal execution is considered to maintain low energy, and the SID frame does not need to be transmitted. After the above conditions are satisfied, the ratio of the spectral energy of the current speech signal frame to the spectral energy of the last mute insertion description frame is calculated according to the following formula:

Where S(i) represents the spectral energy of the current speech signal frame, S _last (i) represents the spectral energy of the previous SID frame of the current frame, and N represents the spectral length.

If the ratio R _{2 of the two is} greater than the threshold THR3 or less than the reciprocal of THR3, THR3 is a real number greater than 1, indicating that the signal energy changes greatly, and a SID frame needs to be sent. Otherwise, the SID frame does not need to be transmitted.

Concrete embodiment 3

In this embodiment, the ratio of the spectral energy is used to determine.

As shown in FIG. 4, when determining whether to send a SID frame, first determining that at least one of an absolute value of a spectral energy of a current speech signal frame and an absolute value of a spectral energy of a previous mute insertion description frame is greater than a single frame energy threshold, If the above conditions are not met, the signal execution is considered to maintain low energy, and the SID frame does not need to be transmitted. After the above conditions are satisfied, the ratio of the spectral energy of the current speech signal frame to the spectral energy of the last mute insertion description frame is calculated according to the following formula:

If the ratio R _{2 of the two is} greater than the threshold THR3 or less than the reciprocal of THR3, THR3 is A real number greater than 1 indicates that the signal energy has changed greatly, and the next step is judged. Otherwise, there is no need to send a SID frame.

Further determining that the ratio R _{2 of the two is} greater than the threshold value THR4 or less than the reciprocal of THR4 (THR4 is a real number greater than THR3), indicating that the signal energy suddenly changes very greatly (such as sudden occurrence of very large energy noise in the mute), Then set a continuous update signal and force two SID frames to be sent continuously. When this condition is not met, only one SID frame needs to be sent.

Concrete embodiment 4

In this embodiment, the difference is determined by the difference in spectral energy.

When determining whether to send the SID frame, first determining that at least one of an absolute value of the spectral energy of the current speech signal frame and an absolute value of the spectral energy of the previous mute insertion description frame is greater than a single frame energy threshold, if the above condition is not met, It is considered that the signal execution maintains low energy, and does not need to transmit a SID frame. After satisfying the above conditions, the difference between the spectral energy of the current speech signal frame and the spectral energy of the last mute insertion description frame is calculated according to the following formula:

N-l N-1

R ₃ = X * 5( -∑ 5^( * 5^(

i=0 i=0

If the absolute value of the difference R ₃ is greater than the threshold value THR5, it indicates that the signal energy changes greatly, and the SID frame needs to be sent, and the information of the SID frame storage unit is updated at the same time.

In the above scheme and in the specific embodiment, a hangover algorithm may be added to ensure the sound quality at the end of the speech, and the CNG algorithm initialization is completed. That is, when a silence frame is detected after a continuous speech frame, instead of directly entering the discontinuous transmission mode, the first few silent frames continue to be processed in accordance with the voice frame mode. After that, it enters the discontinuous transmission mode. For example, in the language When the first silence frame is detected after the tone frame, the first 7 silence frames continue to be processed in the voice frame mode. Then, if the detected silence frame is still a silence frame, the SID_ FIRST frame is transmitted, and the SID_UPDATE is transmitted in the third frame after SID_ FIRST, and then the SID frame is sent according to the decision algorithm described above. The hangover algorithm includes counting the continuous speech frames. When the first silence frame is detected, when the value of the continuous speech frame is greater than the set buffer threshold (thr hangover), the buffer algorithm is set according to the above buffer algorithm. Buffer phase, otherwise, send SID_UPDATE directly, and enter the automatic detection state, and the count of consecutive speech frames will be cleared.

In the above scheme and in the specific embodiment, the maximum SID interval threshold value may also be set. When the current frame is judged, the interval between the current frame and the previous SID frame exceeds the maximum SID interval threshold, and the SID is forced to be updated to ensure the stability of the system and reduce the adverse effects caused by abnormal conditions such as SID frame loss.

In the above scheme and in the specific embodiment, a minimum SID interval threshold value may also be set. When the current frame is judged, when the interval between the current frame and the previous SID frame exceeds the minimum SID interval threshold, it is determined that the SID frame is not sent, and is not updated temporarily, so as to reduce frequent transmission of the SID frame.

The solution can be used for real-time two-way communication, such as wireless, IP conferencing, television, and other areas of voice transmission, to effectively save bandwidth resources and improve network usage efficiency without substantially affecting sound quality. The scheme has low computational complexity, accurate tracking of signal spectrum changes, effective tracking in the case of fast noise changes, effective bandwidth saving in the case of noise smoothness, and independent of specific speech and audio encoders. Flexible and efficient.

It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other. It is a matter of course that the invention may be embodied in various other forms and modifications without departing from the spirit and scope of the invention. One of ordinary skill in the art can understand that all or part of the above steps can be completed by a program to instruct related hardware, and the program can be stored in a computer readable storage medium, such as read only. Memory, disk or disc, etc. Alternatively, all or part of the steps of the above embodiments may also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the foregoing embodiment may be implemented in the form of hardware, or may be implemented in the form of a software function module. The invention is not limited to any specific form of combination of hardware and software.

Industrial applicability

Claims

Claim

1. A method for performing speech adaptive discontinuous transmission, comprising:

2. The method of claim 1 wherein

3. The method according to claim 2, wherein the step of deciding whether to send the mute insertion description frame according to the current speech signal frame and the frequency information of the previous mute insertion description frame comprises:

4. The method according to claim 2, wherein the step of deciding whether to send the mute insertion description frame according to the current speech signal frame and the spectrum information of the previous mute insertion description frame comprises:

The method according to claim 3 or 4, wherein the difference between the spectral energy of the speech signal frame and the spectral energy of the previous mute insertion description frame is greater than a preset limit:

The ratio of the spectral energy of the speech signal frame to the spectral energy of the previous mute insertion description frame is greater than a ratio threshold corresponding to the preset limit or less than a reciprocal of the ratio threshold, wherein the ratio threshold is a real number greater than one;

or, The absolute value of the difference between the spectral energy of the speech signal frame and the spectral energy of the last mute insertion description frame is greater than the difference threshold.

The method according to claim 2, wherein the step of determining whether to send the mute insertion description frame according to the current speech signal frame and the frequency information of the previous mute insertion description frame comprises:

7. A device for performing speech adaptive discontinuous transmission, comprising: a mute insertion description frame processing unit and a mute insertion description frame storage unit; wherein

8. The apparatus according to claim 7, wherein

9. The apparatus according to claim 8, wherein the mute insertion description frame processing unit is configured to decide whether to transmit a mute insertion description frame by:

Determining an absolute value of a spectral energy of the speech signal frame and/or an absolute value of a spectral energy of the last mute insertion description frame is greater than a single frame energy threshold, and a spectral energy of the speech signal frame and a previous mute insertion description And sending a mute insertion description frame when the difference of the spectral energy of the frame is greater than the first preset limit; or determining an absolute value of the spectral energy of the speech signal frame and/or an absolute value of the spectral energy of the previous mute insertion description frame The value is greater than a single frame energy threshold, and a difference between a spectral energy of the speech signal frame and a spectral energy of the previous mute insertion description frame is greater than a first predetermined limit Further determining whether the difference between the spectral energy of the speech signal frame and the spectral energy of the previous mute insertion description frame is greater than a second preset limit, and if so, continuously transmitting two mute insertion description frames, wherein the second pre- The spectral energy difference corresponding to the limit is greater than the spectral energy difference corresponding to the first preset limit;

The difference between the spectral energy of the speech signal frame and the spectral energy of the previous mute insertion description frame is greater than a preset limit: the spectral energy of the speech signal frame and the spectral energy of the previous mute insertion description frame The ratio is greater than a ratio threshold corresponding to the preset limit or less than a reciprocal of the ratio threshold, wherein the ratio threshold is a real number greater than 1; or the spectral energy of the speech signal frame and the spectral energy of the previous muting insertion description frame The absolute value of the difference is greater than the difference threshold.

10. The apparatus according to claim 8, wherein the mute insertion description frame processing unit is configured to decide whether to transmit a mute insertion description frame by: