CN112489665A

CN112489665A - Voice processing method and device and electronic equipment

Info

Publication number: CN112489665A
Application number: CN202011254361.0A
Authority: CN
Inventors: 秦永红; 李勇强
Original assignee: Beijing Rongxun Technology Co ltd
Current assignee: Beijing Rongxun Technology Co ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2021-03-12
Anticipated expiration: 2040-11-11
Also published as: CN112489665B

Abstract

The embodiment of the invention discloses a voice processing method, a voice processing device and electronic equipment, wherein the method comprises the following steps: when the packet loss of the voice frame is detected, determining the voice frame as the current voice frame and acquiring the redundant information of the correct voice frame adjacent to the current voice frame; decoding the current voice frame according to the redundant information; the redundant information comprises excitation pulse parameters and coding parameters within a preset transmission time length adjacent to the transmission time of a correct speech frame. The technical scheme of the embodiment solves the technical problems of high bandwidth consumption and network congestion easily caused by voice frame retransmission and forward error correction coding transmission, realizes that a small amount of redundant information is added in a part of voice frames to enhance data recovery after packet loss, and simultaneously achieves the technical effects of saving bandwidth, avoiding network congestion and improving voice quality.

Description

Voice processing method and device and electronic equipment

Technical Field

The present invention relates to audio processing technologies, and in particular, to a method and an apparatus for processing speech, and an electronic device.

Background

In an actual voice call, the call quality is mainly affected by network packet loss. The instability of the transmission network can cause the phenomenon of packet loss in the transmission process of the voice information, so that the voice is blocked and discontinuous.

At present, in order to recover data after packet loss, a retransmission method can be used, but the method needs additional bandwidth consumption, network congestion is easily caused, and once continuous packet loss occurs, recovery is difficult. In addition, in real-time voice communication, if a certain delay is exceeded, even if retransmission arrives, it is discarded. By using a forward error correction mode, data does not need to be retransmitted, but the calculation overhead and complexity are increased during encoding and decoding, the reliability and the smaller reply delay are replaced by the processing capacity and the bandwidth, and the performance is obviously reduced under the condition of higher packet loss rate. The error concealment technology adopted at the receiving end is easy to realize, but the concealment performance is poor and the voice quality is poor.

Disclosure of Invention

The embodiment of the invention provides a voice processing method, a voice processing device and electronic equipment, which are used for decoding a packet-lost voice frame according to an adjacent correct voice frame of the packet-lost voice frame, so that the voice quality is improved.

In a first aspect, an embodiment of the present invention provides a speech processing method, including:

when the packet loss of the voice frame is detected, determining the voice frame as a current voice frame and acquiring redundant information of a correct voice frame adjacent to the current voice frame;

decoding the current voice frame according to the redundant information;

wherein the redundant information comprises excitation pulse parameters and coding parameters within a preset transmission duration adjacent to the transmission time of the correct speech frame.

In a second aspect, an embodiment of the present invention further provides a speech processing apparatus, including:

the redundant information determining module is used for determining the voice frame as a current voice frame and determining to acquire redundant information of a correct voice frame adjacent to the current voice frame when the packet loss of the voice frame is detected;

the voice decoding module is used for decoding the current voice frame according to the redundant information;

In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the voice transmission method according to any of the embodiments of the present invention.

In a fourth aspect, the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are used to perform the voice transmission method according to any one of the embodiments of the present invention.

According to the technical scheme of the embodiment of the invention, when the voice frame packet loss is detected, the voice frame is taken as the current voice frame, the redundant information of the correct voice frame adjacent to the current voice frame is obtained, and the current voice frame is decoded according to the redundant information, so that the technical problems of high bandwidth consumption and network congestion easily caused by voice frame retransmission and forward error correction coding transmission are solved, a small amount of redundant information is added into a part of voice frames, the data recovery after packet loss is enhanced, the bandwidth is saved, the network congestion is avoided, and the technical effect of voice quality is improved.

Drawings

In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, a brief description is given below of the drawings used in describing the embodiments. It should be clear that the described figures are only views of some of the embodiments of the invention to be described, not all, and that for a person skilled in the art, other figures can be derived from these figures without inventive effort.

Fig. 1 is a schematic flow chart of a speech processing method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a speech processing method according to a second embodiment of the present invention;

fig. 3 is a schematic diagram illustrating sub-frame division of a second correct frame according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a speech processing apparatus according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a schematic flow chart of a speech processing method according to an embodiment of the present invention, where the embodiment is applicable to a situation where a lost speech frame is decoded by an adjacent correct speech frame of the lost speech frame when a speech frame is lost, and the method may be executed by a speech processing apparatus, and the apparatus may be implemented in a form of software and/or hardware.

As shown in fig. 1, the method of this embodiment specifically includes the following steps:

s110, when the voice frame packet loss is detected, the voice frame is determined to be the current voice frame, and the redundant information of the correct voice frame adjacent to the current voice frame is obtained.

Wherein the correct speech frame adjacent to the current speech frame is the correct speech frame adjacent to the current speech frame before and/or after the current speech frame. The redundant information may include excitation pulse parameters and coding parameters within a preset transmission duration adjacent to the transmission time of the correct speech frame. The preset transmission duration is a preset duration for acquiring the redundant information, and optionally, the duration of one frame of voice frame or two frames of voice frames is used as the preset transmission duration. It is understood that the preset transmission time period may also be set according to actual requirements, for example, 20ms, and the specific value of the preset transmission time period is not limited herein. The excitation pulse parameters may be the location and amplitude parameters of the optimal pulse of the excitation. The determination method of the excited optimal pulse may be determined by using a closed-loop search method, such as a synthetic analysis method, with the minimum perceptually weighted mean square error as a decision criterion. Perceptual weighting refers to the combination of the auditory masking effect of the human ear, the higher energy bands in the speech spectrum, and the less perceptible noise at the formants relative to the lower energy bands. The encoding parameter may be understood as a parameter used by a codec in encoding a speech frame, and optionally, the encoding parameter of the embodiment of the present invention may be an encoding parameter required in decoding. Specifically, the redundant information of the speech frame may include the position and amplitude parameters of the optimal pulse within a preset transmission duration before the speech frame and the encoding parameters of the speech frame for decoding.

Specifically, when a voice frame is received, the voice frame may be detected, and if it is detected that the voice frame is a complete voice frame, that is, a voice frame without packet loss, the voice frame and encoding parameter information required for decoding the voice frame may be input to a decoder to decode the voice frame; and if the packet loss of the voice frame is detected, taking the voice frame as the current voice frame, and acquiring redundant information in a first correct voice frame after the current voice frame so as to recover the packet loss of the current voice frame in the follow-up process. The reason for acquiring the redundant information in the first correct speech frame after the current speech frame is that the first correct speech frame contains the excited optimal pulse parameters in the preset transmission time length before the first correct speech frame, and the parameter information is related to the current speech frame, so that packet loss recovery can be performed on the current speech frame according to the parameter information.

It should be noted that, if the time length between the first correct speech frame after the current speech frame and the current speech frame exceeds the preset interval time length, it indicates that the case of poor effect may be caused when the redundant information of the first correct speech frame is used to recover the lost packet of the current speech frame, and at this time, the correct speech frame adjacent to the current speech frame may be used to recover the current speech frame through the techniques of speech signal error concealment and the like. The preset interval duration can be used to determine the duration of the availability of redundant information in the first correct speech frame after the current speech frame.

And S120, decoding the current voice frame according to the redundant information.

Specifically, if the first correct speech frame after the current speech frame does not contain redundant information or the redundant information in the first correct speech frame is lost, the adaptive codebook generated by packet loss recovery is used for decoding. If the first correct speech frame after the current speech frame contains available redundant information, the optimal impulse parameters of excitation in the preset transmission duration before the first correct speech frame can be determined according to the redundant information in the first correct speech frame, and the parameters contain the relevant information of the current speech frame, so that the current speech frame can be decoded according to the parameter information. The specific decoding method may be to determine the excitation pulse parameter corresponding to the redundant information of the first correct speech frame, and further use the pulse parameter to replace the periodic part of the excitation signal generated by the adaptive codebook corresponding to the current speech frame to decode the current speech frame.

According to the technical scheme of the embodiment, when the voice frame packet loss is detected, the voice frame is used as the current voice frame, the redundant information of the correct voice frame adjacent to the current voice frame is obtained, and the current voice frame is decoded according to the redundant information, so that the technical problems of high bandwidth consumption and network congestion easily caused by voice frame retransmission and forward error correction coding transmission are solved, a small amount of redundant information is added into a part of voice frames, the data recovery after packet loss is enhanced, the bandwidth is saved, the network congestion is avoided, and the technical effect of voice quality is improved.

Example two

Fig. 2 is a schematic flow chart of a speech processing method according to a second embodiment of the present invention, and this embodiment further optimizes the step "decode the current speech frame according to the redundant information" based on the above embodiment. Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted.

As shown in fig. 2, the method specifically includes the following steps:

s210, when the voice frame packet loss is detected, determining the voice frame as the current voice frame and acquiring the redundant information of the correct voice frame adjacent to the current voice frame.

When a voice frame is obtained, packet loss detection needs to be performed on the voice frame in order to ensure voice quality, if a packet loss occurs in the voice frame, the voice frame needs to be further determined as a current voice frame, and redundant information of a correct voice frame adjacent to the current voice frame is obtained, so that packet loss recovery can be performed on the current voice frame subsequently.

In order to enable the voice frame to carry the subsequent redundant information for packet loss recovery, the required redundant information may be added to the voice frame at the encoding end, and specifically, the steps of adding the redundant information to the voice frame and encoding the redundant information are as follows:

step one, detecting a voice frame to be transmitted according to a voice activity detection method and a fundamental tone detection method.

Voice Activity Detection (VAD) is used to detect whether a Voice signal is present, and VAD detection can identify a silent Voice signal and an un-silent Voice signal. The pitch detection is used to estimate the pitch or fundamental frequency of a periodic signal in a speech signal, and may use an aperiodic signal as a noise signal and a periodic signal as a non-noise signal. Further, a non-mute and non-noise signal may be used as the pitch signal. The speech frame to be transmitted is an original speech frame acquired by the encoding end, and the speech frame to be transmitted can also be used as the speech frame to be transmitted.

If the voice frame is a mute signal or a noise signal, it indicates that no information with practical significance exists in the voice frame. In order to determine whether the speech frame contains a fundamental tone signal, the speech frame to be transmitted can be detected according to the two speech detection methods.

Specifically, the detection of the voice frame to be transmitted may be respectively performing VAD detection and pitch detection on the voice frame to be transmitted. In order to simplify the detection process, the voice frame to be transmitted may be first subjected to VAD detection to determine whether the voice frame to be transmitted is a mute signal, and further, the non-mute voice frame is subjected to pitch detection to determine whether the voice frame includes a pitch signal.

And step two, if the voice frame to be transmitted contains fundamental tones, determining excitation pulse parameters and coding parameters in a preset transmission time length adjacent to the transmission time of the voice frame to be transmitted as redundant information of the voice frame to be transmitted, and coding the voice frame to be transmitted and the redundant information.

If the voice frame to be transmitted contains the fundamental tone after detection, the voice frame is indicated to contain the voice information with practical significance, and redundant information can be carried in the voice frame so as to ensure that packet loss recovery can be carried out when the voice frame has packet loss.

Specifically, the position and amplitude of an excited optimal pulse within a preset transmission duration before the transmission time of a speech frame to be transmitted are used as excitation pulse parameters, parameters required by a coder-decoder during speech frame coding and decoding are used as coding parameters, and the excitation pulse parameters and the coding parameters are determined as redundant information of the speech frame to be transmitted. Furthermore, the speech frame to be transmitted and the redundant information can be encoded by the encoder.

S220, determining a correct voice frame adjacent to the current voice frame and located in a preset obtaining time length behind the current voice frame as a first correct frame.

The preset obtaining time length is the time length for waiting to obtain the first correct frame after the current voice frame. Specifically, the preset obtaining duration may be the longest duration between the time when the current speech frame is received and the time when the first correct frame is received. If the time length between the first correct frame and the current speech frame is longer than the preset interval time length, the condition of poor effect may be caused when the redundant information of the first correct frame is used for carrying out packet loss recovery on the current speech frame, and at the moment, the correct speech frame adjacent to the current speech frame can be used for recovering the current speech frame through the technologies of speech signal error concealment and the like. In order to ensure the availability of redundant information in the first correct frame, the preset acquisition duration should be less than or equal to the preset interval duration.

Specifically, a first correct speech frame located within a preset acquisition duration after the current speech frame is used as a first correct frame for subsequent decoding of the current speech frame.

S230, determining whether the first correct frame includes redundant information, if so, performing S240, and if not, performing S250.

After the first correct frame is obtained, it is necessary to determine whether the first correct frame includes redundant information. If the first correct frame contains redundant information, it indicates that the current speech frame can be decoded according to the redundant information in the first correct frame to achieve the purpose of packet loss recovery. If the first correct frame does not contain redundant information, it indicates that the first correct frame may be a silent or noisy speech frame, or the redundant information in the first correct frame is missing, so that the current speech frame cannot be decoded according to the first correct frame, and the speech frame signal before the current speech frame can be used for decoding.

And S240, decoding the current voice frame according to the redundant information.

According to the redundant information in the first correct frame, the optimal pulse parameters of excitation in the preset transmission duration before the first correct frame can be determined, and because the parameters contain the relevant information of the current speech frame, the current speech frame can be decoded according to the parameter information.

It should be noted that the adaptive codebook is obtained by performing a pitch analysis on the speech frame, such as a cepstrum. Each speech frame generates a corresponding adaptive codebook, and the parameters of the adaptive codebook include pitch lag and pitch filter gain. The speech frame may be decoded from the parameter information in the adaptive codebook.

Specifically, the adaptive codebook of the current speech frame may be generated by using packet loss recovery, and the periodic part of the excitation signal of the current frame is generated by using the excitation pulse parameter in the redundant information of the first correct frame instead of the adaptive codebook to decode the current speech frame.

And S250, determining the correct voice frame adjacent to the current voice frame and before the current voice frame as a second correct frame.

Specifically, a first correct speech frame before the current speech frame is used as a second correct frame for subsequent decoding of the current speech frame.

S260, determining whether the second correct frame includes a pitch, if so, executing S270, and if not, executing S280.

After the second correct frame is obtained, it is necessary to determine whether the second correct frame includes a pitch. If the second correct frame contains the fundamental tone, the line spectrum frequency parameter of the current voice frame and the fundamental tone period of the current voice frame can be obtained according to the second correct frame so as to decode the current voice frame; if the second correct frame does not contain the fundamental tone, the line spectrum frequency parameter of the current speech frame and the self-adaptive codebook of the current speech frame can be obtained according to the first correct frame so as to decode the current speech frame.

S270, obtaining the line spectrum frequency parameter of the current voice frame and the pitch period of the current voice frame according to the second correct frame so as to decode the current voice frame.

If the second correct frame contains the fundamental tone, the speech frame before the current speech frame has speech information, and the line spectrum frequency parameter and the fundamental tone period of the current speech frame can be determined according to the second correct frame.

Optionally, the step of determining the line spectrum frequency parameter of the current speech frame according to the second correct frame is as follows:

in order to more accurately determine the line spectrum frequency parameter of the current speech frame, the second correct frame may be divided into a first sub-frame and a second sub-frame.

Specifically, the second correct frame is divided into two parts according to time, the former part of the second correct frame is used as the first subframe, and the latter part of the second correct frame is used as the second subframe, and a subframe division diagram of the second correct frame is shown in fig. 3.

Further, the line spectrum frequency parameter of the current voice frame is determined according to the line spectrum frequency parameter of the second subframe and the line spectrum frequency parameters of a preset number of voice frames adjacent to the current voice frame before the current voice frame.

The preset number may be the number of adjacent speech frames preset for calculating the line spectrum frequency parameter of the current speech frame. The line spectrum frequency parameter of the current speech frame may be determined by performing weighted averaging on the line spectrum frequency parameter of the second subframe and the line spectrum frequency parameters of the preset number of speech frames adjacent to the current speech frame before the current speech frame, or may be determined by performing weighted averaging on the line spectrum frequency parameters of the preset number of speech frames adjacent to the current speech frame before the current speech frame, and then performing weighted averaging on the line spectrum frequency parameters of the second subframe to determine the line spectrum frequency parameter of the current speech frame.

Alternatively, the line spectrum frequency parameter of the current frame may be determined using the following steps:

step one, determining a first part of the line spectrum frequency parameter of the current voice frame according to the line spectrum frequency parameter of the second subframe and a preset parameter.

Since the line spectrum frequency parameters used for encoding the speech frame are quantized, it is necessary to perform inverse quantization first when using the line spectrum frequency parameters. Alternatively, the line spectrum frequency parameters of the preset dimension, such as 10-dimensional line spectrum frequency parameters, can be extracted according to the method described by the standard of the us government 2400bps mixed excitation linear prediction speech coding algorithm.

Specifically, the first part of the line spectrum frequency parameter of the current speech frame may be determined according to the following formula:

lsf_e_1(i)＝β×last_lsf_2(i)

wherein, lsf _ e _1(i) represents the first part of the ith dimension line spectrum frequency parameter of the current speech frame, last _ lsf _2(i) represents the ith dimension line spectrum frequency parameter of the second subframe after the second correct frame inverse quantization, beta represents a preset parameter which is preset and is fixed in each calculation.

The first part of the line spectral frequency parameter for each dimension of the current speech frame may be determined according to the above formula.

And step two, determining a preset number of voice frames adjacent to the current voice frame before the current voice frame as reference voice frames.

In order to make the line spectrum frequency parameter of the current frame more accurate, the preset number may be set to at least two.

And thirdly, dividing each reference speech frame into a first reference subframe and a second reference subframe, and determining the difference value of the line spectrum frequency parameter of the first reference subframe and the line spectrum frequency parameter of the second reference subframe as a reference difference value.

Specifically, each reference speech frame is divided into two parts according to time, the former part of the reference speech frame is used as a first reference subframe, and the latter part of the reference speech frame is used as a second reference subframe. A specific division method may refer to a method of dividing the second correct frame into the first subframe and the second subframe.

The reference difference value for each reference subframe for each bit may be determined according to the following formula:

dif_lsf＝ref_lsf_1-ref_lsf_2

wherein dif _ lsf represents a reference difference, and ref _ lsf _1 and ref _ lsf _2 represent differences between the line spectrum frequency parameter of the first reference subframe and the line spectrum frequency parameter of the second reference subframe, respectively.

And step four, calculating a weighted average value of the at least two reference difference values, and determining a second part of the line spectrum frequency parameter of the current speech frame according to the weighted average value and a preset parameter.

According to the above steps, the number of reference difference values and the number of reference speech frames should be consistent.

Thus, the second part of the line spectral frequency parameter of the current speech frame can be determined according to the following formula:

wherein, lsf _ e _2(i) represents the second part of the line spectrum frequency parameter of the ith dimension of the current speech frame, beta represents a preset parameter which is preset and fixed in each calculation, dif _ lsf (i, j) represents the jth reference difference of the ith dimension, and rho_jAnd the weight of the jth reference difference value is represented, the specific size of the weight can be set according to the actual situation, and n represents the total number of the reference difference values.

And step five, determining line spectrum frequency parameters of the current frame according to the first part and the second part.

Specifically, the line spectrum frequency parameter of the current frame can be determined by performing summation operation on the first part and the second part, and the formula of the line spectrum frequency parameter of the current frame is as follows:

lsf_e(i)＝lsf_e_1(i)+lsf_e_2(i)

the pitch period is an important parameter for describing the excitation source in the speech signal, and if the pitch period of the previous frame is directly copied, the adaptive capacity for the time-varying situation is lacked. Therefore, the pitch period of the current speech frame needs to be determined, and specifically, the pitch period of the current speech frame can be determined according to the pitch periods of the first subframe and the second subframe and the pitch period gains of the first subframe and the second subframe.

Alternatively, the pitch period of the current frame may be determined using the following steps:

and calculating the product of the pitch period of the first subframe and the pitch period gain of the first subframe to obtain a first sub-period. And calculating the product of the pitch period of the second subframe and the pitch period gain of the second subframe to obtain a second sub-period. And calculating the average value of the first sub-period and the second sub-period, and determining the average value as the pitch period of the current speech frame.

Specifically, the pitch period of the current speech frame can be calculated according to the following formula:

wherein T is the pitch period of the current speech frame, T (1) and T (2) are the pitch period of the first subframe and the pitch period of the second subframe respectively, and G (1) and G (2) are the pitch period gain of the first subframe and the pitch period gain of the second subframe respectively

Optionally, if the current speech frame is divided into more than two subframes, the above formula may be expanded as follows:

wherein, T is the pitch period of the current speech frame, T (i) is the pitch period of the ith subframe, g (i) is the pitch period gain of the ith subframe, and n is the number of subframes.

Optionally, when determining the pitch period of the current frame, in order to make the pitch period more suitable for practical applications, the pitch period gain may be adjusted, for example: when T (i +1) is greater than T (i), G (i +1) is adjusted to 1.25 XG (i +1), i.e., G (i +1) is multiplied by a coefficient greater than 1. The specific method for adjusting the pitch period gain is not specifically limited in this embodiment, and may be set according to actual situations.

After determining the line spectrum frequency parameters of the current speech frame and the pitch period of the current speech frame, the current speech frame may be decoded.

Specifically, the determined line spectrum frequency parameter of the current speech frame, the pitch period of the current speech frame, and the coding parameter may be input to a decoder, so as to implement decoding of the current speech frame.

S280, acquiring a line spectrum frequency parameter of the current voice frame and an adaptive codebook of the current voice frame according to the first correct frame so as to decode the current voice frame.

If the second correct frame does not contain the fundamental tone, it indicates that the speech frame before the current speech frame is continuous noise or silence, and the line spectrum frequency parameter and the fundamental tone period of the current speech frame cannot be determined according to the second correct frame. At this time, the line spectrum frequency parameter of the current speech frame may be determined according to the first correct frame.

Optionally, the step of determining the line spectrum frequency parameter of the current speech frame according to the first correct frame is as follows:

and determining the line spectrum frequency parameter of the first correct frame according to the line spectrum frequency parameter of the decoding residual error of the first correct frame and the line spectrum frequency parameter of the decoding residual error of the current voice frame.

Specifically, the line spectrum frequency parameter of the first correct frame may be obtained according to the following formula:

wherein lsf _ r is a line spectrum frequency parameter of the first correct frame, resi _ lsf is a decoding residual line spectrum frequency parameter of the first correct frame, mean _ lsf is a decoding residual line spectrum frequency parameter of the current speech frame, α is an adaptive parameter, the adaptive parameter can be adjusted according to the packet loss rate, and the parameter value is between 0 and 1.

The decoded residual is the difference between the previous and the next speech frames and is generated by the encoder. At the decoding end, the decoding residual can be directly read according to the preset coding rule. Alternatively, the line spectrum frequency parameters of the preset dimension, such as 10-dimensional line spectrum frequency parameters, can be extracted according to the method described by the standard of the us government 2400bps mixed excitation linear prediction speech coding algorithm. The line spectral frequency parameters for each dimension in the first correct frame can be determined using the above formula.

And determining the line spectrum frequency parameter of the current speech frame according to the line spectrum frequency parameter of the first correct frame and the noise line spectrum frequency parameter.

Specifically, the line spectrum frequency parameter of the current speech frame can be obtained according to the following formula:

lsf_e＝lsf_r+lsf_rand

where lsf _ e is the line spectrum frequency parameter of the current speech frame. lsf _ rand is a random noise line spectrum frequency parameter, which can be set in the range of 0-100.

Further, the adaptive codebook of the current speech frame is determined according to the adaptive codebook parameters of the first correct frame and the weighting parameters.

And detecting that the second correct frame does not contain fundamental tones, wherein the current speech frame is a speech frame which is subjected to packet loss after a mute or noise signal. At this time, the adaptive codebook parameters are set to zero, and there will be a large error when the adaptive codebook is used to directly decode the current speech frame, which will affect the quality of the speech signal.

Also, since the pitch period of a speech signal is correlated from frame to frame. Thus, the adaptive codebook parameters for the current speech frame may be determined by non-linear weighting based on the adaptive codebook parameters for the first correct frame. The specific nonlinear weighting method can be set according to application scenarios and practical situations.

After the line spectrum frequency parameter of the current voice frame and the adaptive codebook of the current voice frame are obtained, the current voice frame can be decoded so as to realize the decoding of the current voice frame.

According to the technical scheme of the embodiment, the line spectrum frequency parameter and the adaptive codebook parameter of the current voice frame are adjusted according to the first correct frame or the line spectrum frequency parameter and the pitch period are adjusted according to the second correct frame, so that the current voice frame is decoded, the problems of continuous noise caused by simply adopting the coding parameter of a mute or noise signal for decoding and the technical problems of noise caused by simply adopting the coding parameter of the second correct frame and poor adaptability to time variation conditions are solved, the data recovery after packet loss is enhanced, the bandwidth is saved, and the technical effects of improving the voice quality are achieved.

EXAMPLE III

Fig. 4 is a schematic structural diagram of a speech processing apparatus according to a third embodiment of the present invention, where the speech processing apparatus includes: a redundant information determination module 310 and a speech decoding module 320.

The redundant information determining module 310 is configured to determine that a voice frame is a current voice frame and determine to acquire redundant information of a correct voice frame adjacent to the current voice frame when packet loss of the voice frame is detected; a speech decoding module 320, configured to decode the current speech frame according to the redundant information; the redundant information comprises excitation pulse parameters and coding parameters within a preset transmission time length adjacent to the transmission time of a correct speech frame.

Optionally, the speech decoding module 320 includes:

the first correct frame determining submodule is used for determining a correct voice frame which is adjacent to the current voice frame and is positioned in a preset obtaining time length behind the current voice frame as a first correct frame;

and the first decoding submodule is used for decoding the current voice frame according to the redundant information if the first correct frame contains the redundant information.

Optionally, the speech decoding module 320 further includes:

a second correct frame determining submodule, configured to determine, if the first correct frame does not contain redundant information, a correct speech frame adjacent to and before the current speech frame as a second correct frame;

the second decoding submodule is used for decoding the current speech frame according to the pitch containing state of the second correct frame;

the pitch-containing state includes any one of a contained pitch and an uncontained pitch.

Optionally, if the pitch inclusion state is that the second decoding submodule does not include a pitch, the second decoding submodule includes:

a first parameter determining unit, configured to determine a line spectrum frequency parameter of a first correct frame according to a line spectrum frequency parameter of a decoding residual of the first correct frame and a line spectrum frequency parameter of a decoding residual of a current speech frame;

the second parameter determining unit is used for determining the line spectrum frequency parameter of the current voice frame according to the line spectrum frequency parameter of the first correct frame and the noise line spectrum frequency parameter;

an adaptive codebook determining unit for determining an adaptive codebook of the current speech frame according to the adaptive codebook parameters of the first correct frame and the weighting parameters;

and the first decoding unit is used for decoding the current voice frame according to the line spectrum frequency parameter of the current voice frame and the self-adaptive codebook of the current voice frame.

a subframe determination unit for dividing the second correct frame into a first subframe and a second subframe;

the line spectrum frequency parameter determining unit is used for determining the line spectrum frequency parameter of the current voice frame according to the line spectrum frequency parameter of the second subframe and the line spectrum frequency parameters of a preset number of voice frames adjacent to the current voice frame before the current voice frame;

a pitch period determining unit, configured to determine a pitch period of a current speech frame according to pitch periods of the first subframe and the second subframe and pitch period gains of the first subframe and the second subframe;

and the second decoding unit is used for decoding the current voice frame according to the line spectrum frequency parameter of the current voice frame and the pitch period of the current voice frame.

Optionally, the line spectrum frequency parameter determining unit includes:

the first part determining subunit is used for determining the first part of the line spectrum frequency parameter of the current voice frame according to the line spectrum frequency parameter of the second subframe and the preset parameter;

a reference speech frame determining subunit, configured to determine a preset number of speech frames adjacent to the current speech frame before the current speech frame as reference speech frames; wherein the preset number is at least two;

a reference difference determining subunit, configured to divide each reference speech frame into a first reference subframe and a second reference subframe, and determine a difference between a line spectrum frequency parameter of the first reference subframe and a line spectrum frequency parameter of the second reference subframe as a reference difference;

the second part determining subunit is used for calculating a weighted average value of the at least two reference difference values and determining a second part of the line spectrum frequency parameter of the current speech frame according to the weighted average value and preset parameters;

and the line spectrum frequency parameter determining subunit is used for determining the line spectrum frequency parameter of the current frame according to the first part and the second part.

Optionally, the pitch period determining unit includes:

a first sub-period calculating unit, configured to calculate a product of a pitch period of the first subframe and a pitch period gain of the first subframe, to obtain a first sub-period;

a second sub-period calculating unit, configured to calculate a product of a pitch period of the second subframe and a pitch period gain of the second subframe, to obtain a second sub-period;

and the pitch period determining subunit is used for calculating the average value of the first sub-period and the second sub-period, and determining the average value as the pitch period of the current voice frame.

Optionally, the apparatus further comprises:

the voice frame detection module is used for detecting a voice frame to be transmitted according to the voice activity detection method and the fundamental tone detection method;

and the coding module is used for determining the excitation pulse parameters and the coding parameters in the preset transmission duration adjacent to the transmission time of the voice frame to be transmitted as the redundant information of the voice frame to be transmitted and coding the voice frame to be transmitted and the redundant information if the voice frame to be transmitted contains fundamental tones.

The voice processing device provided by the embodiment of the invention can execute the voice processing method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

It should be noted that, the units and modules included in the system are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the embodiment of the invention.

Example four

Fig. 5 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. FIG. 5 illustrates a block diagram of an exemplary electronic device 40 suitable for use in implementing embodiments of the present invention. The electronic device 40 shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 5, electronic device 40 is embodied in the form of a general purpose computing device. The components of electronic device 40 may include, but are not limited to: one or more processors or processing units 401, a system memory 402, and a bus 403 that couples the various system components (including the system memory 402 and the processing unit 401).

Bus 403 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic device 40 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 40 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 402 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)404 and/or cache memory 405. The electronic device 40 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 406 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus 403 by one or more data media interfaces. Memory 402 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 408 having a set (at least one) of program modules 407 may be stored, for example, in memory 402, such program modules 407 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 407 generally perform the functions and/or methods of the described embodiments of the invention.

The electronic device 40 may also communicate with one or more external devices 409 (e.g., keyboard, pointing device, display 410, etc.), with one or more devices that enable a user to interact with the electronic device 40, and/or with any devices (e.g., network card, modem, etc.) that enable the electronic device 40 to communicate with one or more other computing devices. Such communication may be through input/output (I/O) interface 411. Also, the electronic device 40 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 412. As shown, the network adapter 412 communicates with the other modules of the electronic device 40 over the bus 403. It should be appreciated that although not shown in FIG. 5, other hardware and/or software modules may be used in conjunction with electronic device 40, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 401 executes various functional applications and data processing, for example, implementing a voice processing method provided by an embodiment of the present invention, by running a program stored in the system memory 402.

EXAMPLE five

An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a method for speech processing, the method including:

when the packet loss of the voice frame is detected, determining the voice frame as the current voice frame and acquiring the redundant information of the correct voice frame adjacent to the current voice frame;

decoding the current voice frame according to the redundant information;

the redundant information comprises excitation pulse parameters and coding parameters within a preset transmission time length adjacent to the transmission time of a correct speech frame.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for embodiments of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of speech processing, comprising:

decoding the current voice frame according to the redundant information;

2. The method of claim 1, the decoding the current speech frame according to the redundant information, comprising:

determining a correct voice frame adjacent to the current voice frame and located in a preset acquisition time length behind the current voice frame as a first correct frame;

and if the first correct frame contains redundant information, decoding the current speech frame according to the redundant information.

3. The method of claim 2, further comprising:

if the first correct frame does not contain redundant information, determining a correct voice frame adjacent to the current voice frame and positioned before the current voice frame as a second correct frame;

decoding the current speech frame according to the pitch containing state of the second correct frame;

wherein the pitch inclusion state includes any one of inclusion and non-inclusion of a pitch.

4. The method of claim 3, wherein the pitch inclusion state is no pitch included, and wherein decoding the current speech frame comprises:

determining the line spectrum frequency parameter of the first correct frame according to the line spectrum frequency parameter of the decoding residual error of the first correct frame and the line spectrum frequency parameter of the decoding residual error of the current voice frame;

determining the line spectrum frequency parameter of the current voice frame according to the line spectrum frequency parameter of the first correct frame and the noise line spectrum frequency parameter;

determining the self-adaptive codebook of the current voice frame according to the self-adaptive codebook parameters and the weighting parameters of the first correct frame;

and decoding the current voice frame according to the line spectrum frequency parameter of the current voice frame and the self-adaptive codebook of the current voice frame.

5. The method of claim 3, wherein the pitch inclusion state is a pitch inclusion state, and wherein decoding the current speech frame comprises:

dividing the second correct frame into a first subframe and a second subframe;

determining the line spectrum frequency parameter of the current voice frame according to the line spectrum frequency parameter of the second subframe and the line spectrum frequency parameters of a preset number of voice frames adjacent to the current voice frame before the current voice frame;

determining the pitch period of the current voice frame according to the pitch periods of the first subframe and the second subframe and the pitch period gains of the first subframe and the second subframe;

and decoding the current voice frame according to the line spectrum frequency parameter of the current voice frame and the pitch period of the current voice frame.

6. The method of claim 5, wherein the determining the line spectrum frequency parameter of the current speech frame according to the line spectrum frequency parameter of the second subframe and the line spectrum frequency parameters of a preset number of speech frames adjacent to the current speech frame before the current speech frame comprises:

determining a first part of the line spectrum frequency parameter of the current voice frame according to the line spectrum frequency parameter of the second subframe and a preset parameter;

determining a preset number of voice frames adjacent to the current voice frame before the current voice frame as reference voice frames; wherein the preset number is at least two;

dividing each reference voice frame into a first reference subframe and a second reference subframe, and determining the difference value of the line spectrum frequency parameter of the first reference subframe and the line spectrum frequency parameter of the second reference subframe as a reference difference value;

calculating a weighted average value of at least two reference difference values, and determining a second part of a line spectrum frequency parameter of the current speech frame according to the weighted average value and a preset parameter;

and determining line spectrum frequency parameters of the current frame according to the first part and the second part.

7. The method of claim 5, wherein determining the pitch period of the current speech frame based on the pitch periods of the first and second subframes and the pitch period gains of the first and second subframes comprises:

calculating the product of the pitch period of the first subframe and the pitch period gain of the first subframe to obtain a first sub-period;

calculating the product of the pitch period of the second subframe and the pitch period gain of the second subframe to obtain a second sub-period;

and calculating the average value of the first sub-period and the second sub-period, and determining the average value as the pitch period of the current voice frame.

8. The method of claim 1, further comprising:

detecting a voice frame to be transmitted according to a voice activity detection method and a fundamental tone detection method;

and if the voice frame to be transmitted contains fundamental tones, determining excitation pulse parameters and coding parameters in a preset transmission time length adjacent to the transmission time of the voice frame to be transmitted as redundant information of the voice frame to be transmitted, and coding the voice frame to be transmitted and the redundant information.

9. A speech processing apparatus, comprising:

10. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the voice transmission method of any of claims 1-8.