CN104658539A

CN104658539A - Transcoding method for code stream of voice coder

Info

Publication number: CN104658539A
Application number: CN201310598532.5A
Authority: CN
Inventors: 盖丽
Original assignee: Dalian You Jia Software Science And Technology Ltd
Current assignee: Dalian You Jia Software Science And Technology Ltd
Priority date: 2013-11-20
Filing date: 2013-11-20
Publication date: 2015-05-27

Abstract

The invention discloses a transcoding method for a code stream of a voice coder, and belongs to the technical field of voice coding and decoding. A code stream A transmitted by a communication network 1 passes through a bit stream parsing unit, a decoding unit, a parametric switch unit, a coding unit and a bit stream package unit to obtain a code stream B received by a communication network 2, and the communication networks 1 and 2 employ different speech coding standards.

Description

Transcoding method of code stream of voice encoder

Technical Field

The invention relates to a transcoding method of a code stream of a voice encoder, belonging to the technical field of voice encoding and decoding.

Background

Different communication networks often use different speech coding standards. In order to ensure interoperability between communication networks, it is often necessary to "transcode" between different encoders when connecting between communication networks. The communication network 1 uses a type a speech codec, and the communication network 2 uses a type B speech codec. The conventional voice transcoding method is to transcode in a manner of decoding first and then encoding (DTE), i.e. a type a voice decoder used in the communication network 1 decodes a received bit stream to obtain a time domain voice signal, then a type B voice encoder used in the communication network 2 encodes the time domain voice signal, and sends the encoded bit stream to the communication network 2. The transcoding method has high computational complexity, large time delay and large required memory space, and the quality of synthesized voice is low due to two times of encoding and decoding.

Disclosure of Invention

The invention aims at the proposal of the problems and develops a transcoding method of a code stream of a voice coder.

A transcoding method of a code stream of a voice encoder is characterized in that: a code stream A sent by the communication network 1 passes through a bit stream analysis unit, a decoding unit, a parameter conversion unit, an encoding unit and a bit stream encapsulation unit to obtain a code stream B received by the communication network 2, wherein the communication networks 1 and 2 are communication networks using different voice coding standards, such as a wireless network using an AMR standard and an IP network using a G.729AB standard.

The technical scheme of the invention has the following beneficial effects:

(1) when transcoding the line spectrum pair coefficient, a large amount of voice data is trained by using a Support Vector Regression (SVR) algorithm in advance, so that a mapping model of the line spectrum pair coefficient of the transmitting end and the line spectrum pair coefficient of the receiving end is obtained. On the basis, the mapping from the input line spectrum pair coefficient to the output line spectrum pair coefficient is carried out, so that the conversion of the line spectrum pair coefficient is more accurate, and the quality of the synthesized voice is improved.

(2) The decoded pitch lag integer part T0 is used as the open-loop search result of the encoding end, so that when the closed-loop search is performed, the closed-loop search range can be limited according to the value of T0, thereby improving the quality of the synthesized voice and reducing the calculation amount.

(3) In the process of transcoding the silence insertion description frame, a method of energy parameter direct mapping is adopted, and the calculation of the energy of the silence insertion description frame is eliminated, so that the algorithm complexity is reduced, and the storage capacity is correspondingly reduced.

(4) The frame type information is extracted from the input bit stream, so that the frame type is not judged in the transcoding process, the frame type is directly converted into the frame type which is the same as the received frame type when the bit stream is output, and the synthetic voice quality of a receiving end is effectively improved.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a flowchart of a voice frame transcoding method of the present invention.

Fig. 3 is a flowchart of a method for transcoding parameters of a silence insertion description frame according to the present invention.

The DTE method of AMR to G.729AB transcoding in FIG. 4 is compared with the PESQ of the transcoding method of the present invention.

The DTE method of AMR to G.729AB transcoding in FIG. 5 is compared with WMOPS of the transcoding method of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

as shown in fig. 1: a code stream A sent by a communication network 1 passes through a bit stream analysis unit, a decoding unit, a parameter conversion unit, an encoding unit and a bit stream packaging unit to obtain a code stream B received by a communication network 2, wherein the communication networks 1 and 2 are communication networks using different voice coding standards.

Here, a specific implementation process of the present invention is described by taking a parameter transcoding process from AMR to g.729ab as an example, that is, the a coding standard is AMR, the B coding standard is g.729ab, the communication network 1 is a wireless communication network, and the communication network 2 is an IP network. The AMR frame length is 20ms, the G.729AB frame length is 10ms, the subframe length of the two frames is 5ms, and one AMR frame corresponds to two G.729AB frames. The specific transcoding scheme is as follows:

the bit stream analyzing unit is used for receiving AMR code streams sent by a wireless communication network, and comprises the following specific steps:

(1) according to the frame structure of AMR, the frame type (SPECH _ GOOD, SPECH _ BAD, SID _ FIRST, SID _ UPDATE, SID _ BAD, NO _ DATA), mode information (MR _4.75kbps, MR _5.15kbps, MR _5.9kbps, MR _6.7kbps, MR _7.4kbps, MR _10.2kbps, MR _12.2kbps) and parameter bits are extracted in sequence from the received AMR code stream.

(2) According to the frame structure of AMR, the parameter bits are converted into the parameter values after quantization coding, namely the line spectrum pair coefficient, the pitch delay, the nonzero pulse position and the sign and the gain of the fixed codebook of the speech frame, or the line spectrum pair coefficient and the speech energy of the silence insertion description frame.

(3) Judging the current frame as a SPEECH frame (SPEECH _ GOOD, SPEECH _ BAD), silence insertion description frame (SID _ UPDATE, SID _ BAD) or non-transmission frame (SID _ FIRST, NO _ DATA) according to the frame type information

The decoding unit is used for decoding the parameter bits by the AMR decoder to obtain the voice parameter value and the synthesized voice, and comprises the following specific steps:

(1) if the current frame is a speech frame:

decoding the parameter values after the quantization coding by using an AMR decoder to obtain voice parameters, wherein the voice parameters comprise line spectrum pair coefficients, pitch delay, non-zero pulse positions and symbols of a fixed codebook, self-adaptive codebook gain and fixed codebook gain; speech reconstruction is performed from the above speech parameters using an AMR decoder to obtain reconstructed speech s' (n).

(2) If the current frame is a silence insertion description frame:

the AMR decoder is used for decoding the parameter values after quantization coding so as to obtain silence, and the silence is inserted into line spectrum pair coefficients and voice energy of the description frame.

The parameter conversion unit is used for transcoding the voice parameters obtained by AMR decoding to obtain the voice parameters required by G.729AB quantization coding, and the specific steps are as follows:

(1) if the received AMR frame type is a SPEECH frame (SPEECH _ GOOD or SPEECH _ BAD), the transcoding procedure is as shown in fig. 2:

(a) linear predictive analysis:

transcoding of line spectrum to coefficients involves off-line mapping model parameter acquisition and on-line parameter mapping.

The process of obtaining the mapping model parameters is that firstly, a large amount (more than 10 hours), various types (such as adult male voice, adult female voice, boy voice, girl voice and the like) and various languages (such as Chinese, English, French and the like) of voice data are respectively encoded by AMR and G.729AB encoders to respectively obtain K groups and 2K groups of quantized line spectrum pair coefficients: LSP_AMR(k, i) and LSP_G.729AB(2K, i), i =1, …, n, K =1, …, K, where n is the dimension of the line spectrum to the coefficient vector. Calculating LSP by using support vector regression algorithm_AMRAnd LSP_G.729ABThe mapping model between: LSP_{G.729AB_1}(i)＝LSP_{G.729AB_2}(i)＝w_i ^TLSP_AMR(i)+b_iParameter w of_i、b_iWherein, LSP_{G.729AB_1}(i)、LSP_{G.729AB_2}(i) The line spectral pair coefficients, i =1, …, n, of the two frames of g.729ab, respectively, corresponding to the AMR current frame.

When transcoding, the AMR line spectrum can be used for coefficient LSP_AMR(i) Using n mapping models:

LSP_{G.729AB_1}(i)＝LSP_{G.729AB_2}(i)＝w_i ^TLSP_AMR(i)+b_ii =1, …, n, respectively calculating LSPs_G.729AB(i)，i=1，…，n。

Computing LSPs using support vector regression algorithms_AMRAnd LSP_G.729ABThe mapping model parameter w between_i、b_iThe specific process is as follows:

line spectral pair coefficients LSP defining the speech in the k frame (G.729AB 2k frame)_AMRAnd LSP_G.729ABTraining data x and y, respectively, i.e. x (k, i) = LSP_AMR(k,i),y(k,i)=LSP_G.729AB(2k, i) using n regression functions f_i(x)=w_i ^Tx+b_iFitting data { x (K, i), y (K, i) }, K =1, …, K, i =1, …, n.

Defining n mapping functions

<math> <mrow> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>=</mo> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </munderover> <mrow> <mo>(</mo> <msubsup> <mi>α</mi> <mi>ki</mi> <mo>*</mo> </msubsup> <mo>-</mo> <msub> <mi>α</mi> <mi>ki</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>x</mi> <mi>ki</mi> </msub> <mo>,</mo> <msubsup> <mi>b</mi> <mi>i</mi> <mo>*</mo> </msubsup> <mo>=</mo> <mfrac> <mn>1</mn> <mi>N</mi> </mfrac> <munder> <mi>Σ</mi> <mrow> <mi>j</mi> <mo>&Element;</mo> <mo>{</mo> <mi>j</mi> <mo>|</mo> <msub> <mi>α</mi> <mi>ji</mi> </msub> <mo>></mo> <mn>0</mn> <mo>}</mo> </mrow> </munder> <mo>[</mo> <msub> <mi>y</mi> <mi>ji</mi> </msub> <mo>-</mo> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </munderover> <msub> <mi>y</mi> <mi>ki</mi> </msub> <msub> <mi>α</mi> <mi>ki</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>ki</mi> </msub> <mo>·</mo> <msub> <mi>x</mi> <mi>ji</mi> </msub> <mo>)</mo> </mrow> <mo>]</mo> <mo>,</mo> </mrow> </math>

Wherein, a_kiAnd a_ki ^*Is the lagrange factor. For a given i, the solution for the lagrangian factor is:

defining the Lagrange function:

wherein C is_iIs constant and C_i>0，ζ_kiNot less than 0 andis the relaxation factor and is the fitting accuracy. Maximizing the objective function:

wherein the Lagrange factor a_kiAnd a_ki ^*Satisfy the requirement of

Constitutes a typical quadratic programming problem. Using the KKT conditions: alpha is alpha_ki=0→y_kif(x_ki)≥1；0<α_ki<C→y_kif(x_ki)=1；α_i=C→y_kif(x_ki) And (5) solving the quadratic programming problem by using a sequence minimum optimization algorithm, wherein the quadratic programming problem is less than or equal to 1. The sequence minimum optimization algorithm comprises the following steps:

(1) given an initial value of the Lagrangian factor, typically take alpha_ki=0；

(2) Calculating KKT conditions for the training data, finding data points (x) that violate the KKT conditions_1i,y_1i) Corresponding Lagrange factor alpha_1iIt is taken as one of two Lagrangian factors to be optimized;

(3) finding satisfaction of max | f in training data_i(x_1i)–f_i(x_2i)+y_2i-y_1iData point of | (x)_2i,y_2i) Corresponding Lagrange factor as alpha_2i. Lagrange factor alpha_1iAnd alpha_2iAfter the selection is finished, other Lagrange factors are kept unchanged to form a minimum-scale quadratic programming problem, namely, the optimal quadratic programming problem is solvedAnd

(4) solving the minimum quadratic programming problem:

K_{11} = x_{1 i}^{2},

K_{22} = x_{2 i}^{2},

K_{12} = x_{1 i} x_{2 i},

wherein E_ki=f_i ^old(x_ki)–y_kiTraining errors.

When y is_1i≠y_2iWhen the temperature of the water is higher than the set temperature,

when y is_1i＝y_2iWhen the temperature of the water is higher than the set temperature,

<math> <mrow> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mrow> <mi>new</mi> <mo>,</mo> <mi>clipped</mi> </mrow> </msubsup> <mo>=</mo> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <mi>H</mi> <mo>,</mo> </mtd> <mtd> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mi>new</mi> </msubsup> <mo>&GreaterEqual;</mo> <mi>H</mi> </mtd> </mtr> <mtr> <mtd> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mi>new</mi> </msubsup> <mo>,</mo> </mtd> <mtd> <mi>L</mi> <mo><</mo> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mi>new</mi> </msubsup> <mo><</mo> <mi>H</mi> </mtd> </mtr> <mtr> <mtd> <mi>L</mi> <mo>,</mo> </mtd> <mtd> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mi>new</mi> </msubsup> <mo>≤</mo> <mi>L</mi> </mtd> </mtr> </mtable> </mfenced> <mo>;</mo> </mrow> </math>

<math> <mrow> <msubsup> <mi>α</mi> <mrow> <mn>1</mn> <mi>i</mi> </mrow> <mi>new</mi> </msubsup> <mo>=</mo> <msubsup> <mi>α</mi> <mrow> <mn>1</mn> <mi>i</mi> </mrow> <mi>old</mi> </msubsup> <mo>+</mo> <msub> <mi>y</mi> <mrow> <mn>1</mn> <mi>i</mi> </mrow> </msub> <msub> <mi>y</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mi>old</mi> </msubsup> <mo>-</mo> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mrow> <mi>new</mi> <mo>,</mo> <mi>clipped</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>

obtaining a pair of new Lagrange factorsAnd

(5) checking whether a data point violating the KKT condition exists, and if yes, returning to the step 2); otherwise, obtaining the optimal solution of the whole problem and carrying out the next step.

(6) Obtaining a regression function:

wherein:

<math> <mrow> <msubsup> <mi>b</mi> <mi>i</mi> <mo>*</mo> </msubsup> <mo>=</mo> <mfrac> <mn>1</mn> <mi>N</mi> </mfrac> <munder> <mi>Σ</mi> <mrow> <mi>j</mi> <mo>&Element;</mo> <mo>{</mo> <mi>j</mi> <mo>|</mo> <msub> <mi>α</mi> <mi>ji</mi> </msub> <mo>></mo> <mn>0</mn> <mo>}</mo> </mrow> </munder> <mo>[</mo> <msub> <mi>y</mi> <mi>ji</mi> </msub> <mo>-</mo> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </munderover> <msub> <mi>y</mi> <mi>ki</mi> </msub> <msub> <mi>α</mi> <mi>ki</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>ki</mi> </msub> <mo>·</mo> <msub> <mi>x</mi> <mi>ji</mi> </msub> <mo>)</mo> </mrow> <mo>]</mo> <mo>,</mo> </mrow> </math>

n is the number of support vectors.

The line spectrum pair coefficient of the AMR is mapped by the mapping relation obtained above to obtain the line spectrum pair coefficient of the corresponding frame G.729AB, namely

LSP_{G.729AB_1}=LSP_{G.729AB_2}=f(LSP_AMR)；

This was taken as the g.729ab unquantized line spectrum pair coefficient.

And converting the obtained unquantized line spectrum pair coefficient into a line spectrum frequency coefficient, quantizing and coding according to a G.729AB coding standard, and then transmitting the quantized and coded line spectrum frequency coefficient to the IP network.

According to G.729AB coding standard, a group of line spectrum pair coefficients obtained by mapping and line spectrum pair coefficients of previous frame or several frames are interpolated to obtain unquantized line spectrum pair coefficients of every sub-frame, thenThe unquantized linear prediction coefficient A (z) of each subframe is calculated from the unquantized line spectrum pair coefficient of each subframe. And quantizing the group of line spectrum pair coefficients obtained by mapping to obtain a group of quantized line spectrum pair coefficients of the current frame, interpolating the quantized line spectrum pair coefficients of the current frame and the quantized line spectrum pair coefficients of the previous frame or a plurality of frames to obtain the quantized line spectrum pair coefficients of each subframe, and calculating the quantized linear prediction coefficients A' (z) of each subframe according to the quantized line spectrum pair coefficients of each subframe. The unquantized and quantized linear prediction coefficients are used to calculate the perceptual weighting filter w (z) = a (z/gamma), respectively₁)/A(z/γ₂)(γ₁And gamma₂Perceptual weighting coefficients) and the coefficients of the synthesis filter 1/a' (z).

(b) Open-loop pitch search:

in a speech coding algorithm based on code excited linear prediction, pitch search is done in two steps. The first step is an open-loop pitch search, where the pitch period is estimated approximately, denoted as T _ op, in order to provide a coarse range for the closed-loop pitch search to reduce the amount of computation for the closed-loop pitch search. The second step is to perform a closed loop pitch search around T _ op.

In transcoding, the normal open-loop pitch search process is omitted, and the decoded pitch-delayed integer part T0 is used directly as the open-loop pitch search result T _ op encoded in the g.729ab encoding standard:

T_op_{G.729AB_1}=T_op_{G.729AB_2}=T0_AMR；

wherein, T _ op_{G.729AB_1}、T_op_{G.729AB_2}Respectively, representing the open-loop pitch lag of two frames of g.729ab corresponding to the AMR current frame.

(c) Target signal for computing impulse response and adaptive codebook search of perceptual weighted synthesis filter

Perceptual weighted synthesis filter h (z) = a (z/γ)₁)/(A'(z)A(z/γ₂) Impulse response h (n) for adaptive codebook and fixed codebook search, typically once per subframe. Punching machineThe laser signal passes through a filter A (z/gamma)₁) Then, the mixture is successively passed through 1/A' (z) and 1/A (z/gamma)₂) To obtain h (n).

The calculation process of the target signal x (n) of the adaptive codebook search is as follows: first, the residual signal res of the linear prediction filter is calculated_LP(n) the calculation formula is:

where s' (n) is the reconstructed speech resulting from the decoding,for quantized linear prediction coefficients, P is the order of the linear prediction filter. Then the residual signal res_LP(n) pass through a perceptually weighted synthesis filter H (z), res_LP(n) convolving with h (n) to obtain a target signal x (n):

x(n)＝res_LP(n)*h(n)；

(d) adaptive codebook search

The adaptive codebook search includes a closed loop pitch search and a calculation of an adaptive codebook vector.

The criterion for closed-loop pitch search is to minimize the mean square error between the reconstructed speech at the decoding end and the reconstructed speech at the encoding end, even if r (k) is maximum:

wherein x (n) is a target signal, y_k(n) is the past filtered excitation at delay k (past excitation convolved with h (n)), and len is the subframe length.

When the closed-loop pitch search is performed, the search range is limited to be around a preselected value T _ op, and the range of the closed-loop pitch search is determined according to the value of the integer pitch delay T0 obtained by decoding:

[T0-g(T0),T0+g(T0)]，

wherein,

and performing closed-loop pitch search within a limited range to obtain the optimal integer pitch delay k. If the resolution of k is in the range of fractional delay according to the coding standard of the receiving end, the fraction near the optimal integer delay is tested. The normalized correlation coefficient r (k) is interpolated and its maximum value is searched for a fractional pitch period.

Where t =0,1, …, -1, b is the reciprocal of the fractional delay resolution_mAre the interpolation filter coefficients.

After pitch lag determination, the adaptive codebook vector is computed by interpolating the past excitation u (n) at a given integer delay k and fractional delay t:

after the adaptive codebook is determined, the gain g of the adaptive codebook may be calculated_p：

Where len is the sub-frame length, x (n) is the target signal of adaptive codebook search, and the convolution y (n) of v (n) and h (n) is the adaptive codebook vector filtered signal, i.e. y (n) = v (n) × h (n), where h (n) is the impulse response of the perceptual weighted synthesis filter h (z).

(e) Fixed codebook search

The fixed codebook vector can be expressed as:

wherein (N) is a unit pulse, N_pThe number of non-zero pulses in the fixed codebook vector, len is the subframe length. m is₁,m₂,...,Indicating the position of a non-zero pulse, S₁,S₂,...,The symbol (1 or-1) representing the non-zero pulse at the corresponding position, c (N) being a len-dimensional vector, divided by N_pExcept for the non-zero pulse, all other elements are 0.

Fixed codebook search is with the reconstructed speech s 'of the decoding side weighted'_w(n) and the weighted mean square error minimization criterion between the reconstructed speech at the encoding end to search the fixed codebook vector, i.e., to determine the location and sign of the non-zero pulse in the codebook vector.

The fixed codebook search is carried out by firstly calculating target signal

x₂(n)＝x(n)-g_py(n),n＝0,1,Λ,len-1；

Wherein x (n) is the target signal of adaptive codebook search, y (n) = v (n) × h (n) is the adaptive codebook vector filtering signal, g_pFor adaptive codebook gain, len subframe length.

If c is a codebook vector, then the codebook vector that maximizes the following equation is obtained:

wherein d is x₂(n) and the impulse response h (n) of the perceptually weighted synthesis filter,is the autocorrelation matrix of h (n), and T represents the matrix transposition.

The elements of vector d are calculated as follows:

where len is the subframe length. The elements of the symmetric matrix are calculated as follows:

the term of the molecule concerned in formula (1) can be represented by the following formula:

wherein u is_iIndicating the position of the ith pulse, S_iSymbol representing the ith pulse, N_pThe number of non-zero pulses in the fixed codebook vector. The denominator in formula (1) is given by:

u maximizing formula (1)₁,u₂,...I.e. the desired non-zero pulse position.

Position m of fixed codebook vector obtainable by AMR decoding at transcoding₁,m₂,Λ,Limiting the search range of G.729AB fixed codebook search to make the search of fixed codebook in m₁,m₂,Λ,A simplified search is conducted near the location.

The codebook with fixed codebook gain is searched with minimum mean square weighted error between the reconstructed speech decoded by the AMR decoder and the reconstructed speech encoded by the g.729ab encoder, even if the following equation is minimized:

E = {| | x - g_{p} y - g_{c} z | |}^{2} = x^{T} x + g_{p}^{2} y^{T} y + g_{c}^{2} z^{T} z - 2 g_{p} x^{T} y - 2 g_{c} x^{T} z + 2 g_{p} g_{c} y^{T} z

where x is the target vector of the fixed codebook search, y is the adaptive codebook vector filtered signal, and z is the convolution of the fixed codebook vector with h (n):

where len is the subframe length.

(2) If the received AMR frame type is a silence insertion description frame (SID _ UPDATE, SID _ BAD), the transcoding flow chart is as shown in fig. 3.

(a) And taking the line spectrum pair coefficient of the decoded AMR as the line spectrum pair coefficient of the corresponding frame G.729AB:

LSP_{G.729AB_1}=LSP_{G.729AB_2}=LSP_AMR

wherein, LSP_{G.729AB_1}、LSP_{G.729AB_2}Respectively, the line spectrum pair coefficients of two frames of g.729ab corresponding to the AMR current frame.

(b) And converting the speech energy parameter of the AMR obtained by decoding into an energy parameter of a corresponding frame G.729AB:

ener_{G.729AB_1}=ener_{G.729AB_2}=1.09ener_AMR+981

wherein, the ene_{G.729AB_1}、ener_{G.729AB_2}Respectively, the energy parameters of two frames of g.729ab corresponding to the AMR current frame.

The coding unit is used for carrying out quantization coding on the obtained parameters, and the specific steps are as follows:

(1) if the current frame is a speech frame, the parameters comprise a line spectrum pair coefficient, a pitch delay, a non-zero pulse position and symbol of a fixed codebook, a self-adaptive codebook gain and a fixed codebook gain, and a G.729AB encoder is used for quantizing and encoding each parameter to obtain information bits.

(2) If the current frame is a silence insertion description frame, the parameters are line spectrum pair coefficients and voice energy, and the parameters are quantized and coded according to a G.729AB coding standard to obtain information bits.

The bit stream packaging unit is used for packaging and outputting the parameter bit, the mode information and the frame type, wherein the output frame type is assigned according to the received frame type. If the received AMR frame type is a SPEECH frame (SPEECH _ GOOD or SPEECH _ BAD), the G.729AB frame type is assigned as a SPEECH frame (RATE _ 8000); if the received AMR frame type is a silence insertion description frame (SID _ UPDATE or SID _ BAD), the G.729AB frame type is assigned as a silence insertion description frame (RATE _ SID); if the received AMR frame type is a non-transmission frame (NO _ DATA or SID _ FIRST), the G.729AB frame type is assigned as a non-transmission frame (RATE _ 0).

(1) When transcoding the line spectrum pair coefficient, a support vector regression algorithm is used in advance to train a large amount of voice data, so that a mapping model of the line spectrum pair coefficient of the sending end and the line spectrum pair coefficient of the receiving end is obtained. On the basis, the mapping from the input line spectrum pair coefficient to the output line spectrum pair coefficient is carried out, so that the conversion of the line spectrum pair coefficient is more accurate, and the quality of the synthesized voice is improved.

Fig. 4 shows the voice quality results of the conventional DTE method and the parameter transcoding method provided by the present invention when AMR is transcoded to g.729ab. The method comprises the steps of taking bit streams obtained by coding AMR standard test sequences t10. pcm-t 19.pcm at various rates as input, converting the coded bit streams of each code rate into G.729AB code streams by using a DTE method and the method provided by the invention, decoding the code streams by using a G.729AB decoder to obtain synthetic voices, and performing PESQ objective voice quality evaluation on the synthetic voices and the standard test sequences (t 10. pcm-t 19. pcm) corresponding to the AMR.

Fig. 5 shows the comparison of the operation complexity results between the conventional DTE method and the parameter transcoding method provided by the present invention when AMR is transcoded to g.729ab. Multiple speech segments (speech file see fig. 4) are selected from the standard test sequence of AMR for WMOPS testing, and the worst case results for each mode are listed in the table.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A transcoding method of a code stream of a voice encoder is characterized in that: a code stream A sent by a communication network 1 passes through a bit stream analysis unit, a decoding unit, a parameter conversion unit, an encoding unit and a bit stream packaging unit to obtain a code stream B received by a communication network 2, wherein the communication networks 1 and 2 are communication networks using different voice coding standards.

2. The method of claim 1, wherein the transcoding method comprises: the bitstream parsing unit is configured to receive an a code stream sent by the communication network 1, and includes the following specific steps:

(1) according to the frame structure of the a coding standard of the communication network 1, mode information, frame type information, and parameter bits are extracted from the corresponding bits of the input a code stream.

(2) Converting the parameter bits into parameter values after speech parameter quantization coding according to a frame structure of an A coding standard of the communication network 1, wherein the parameters of the speech frame comprise line spectrum pair coefficients, pitch delay, non-zero pulse positions, symbols and gains of a fixed codebook; parameters of silence insertion description frames are line spectrum pair coefficients and speech energy.

(3) And extracting frame type information from the A code stream, and judging whether the received frame type is a voice frame, a non-transmission frame or a silence insertion description frame.

3. The method of claim 1, wherein the transcoding method comprises: the decoding unit is used for decoding the parameter bits by the decoder A to obtain the voice parameter value and the synthesized voice, and comprises the following specific steps:

(1) and if the received frame type is a silence insertion description frame, decoding according to the received parameter index value to obtain a voice parameter value, wherein the parameters are a line spectrum pair coefficient and an energy ener.

(2) If the received frame type is a speech frame, then:

(a) decoding to obtain a voice parameter value according to the received parameter index value, wherein the parameters comprise line spectrum pair coefficients, integral part T0 and fractional part T0_ frac of pitch delay, non-zero pulse position and sign of fixed codebook, and quantized adaptive codebook gain g'_pAnd quantized fixed codebook gain g'_c,

(b) Based on the speech parameters, speech reconstruction is performed using the A coding standard of the communication network 1 to obtain reconstructed speech s' (n),

(c) after the reconstructed speech s' (n) is obtained, the post-processing in the a decoder is not performed.

4. The method of claim 1, wherein the transcoding method comprises: the parameter conversion unit is used for transcoding the decoded voice parameters to obtain the voice parameters required by the B coding standard quantization coding of the communication network 2, and comprises the following specific steps:

(1) if the received voice frame is a voice frame, the transcoding step is as follows:

(a) linear predictive analysis:

transcoding of line spectrum to coefficients involves off-line mapping model parameter acquisition and on-line parameter mapping,

firstly, respectively coding voice data of more than 10 hours, various types and various languages by an A, B coder to obtain K groups of quantized line spectrum pair coefficients, wherein the various types comprise adult male voice, adult female voice, boy voice and girl voice; the various languages include chinese, english, french, spanish, arabic: LSP_A(k, i) and LSP_B(K, i), K =1, …, K, i =1, …, n, where n is the dimension of the line spectrum to coefficient vector; calculating LSP by using support vector regression algorithm_AAnd LSP_BThe mapping model between: LSP_B(i)＝w_i ^TLSP_A(i)+b_iParameter w of_i、b_iI =1, …, n; when transcoding, the A encoder can use line spectrum to pair coefficient LSP_A(i) Using n mapping models: LSP_B(i)＝w_i ^TLSP_A(i)+b_iI =1, …, n, respectively calculating LSPs_B(i)，i=1，…，n；

Computing LSPs using Support Vector Regression (SVR) algorithms_AAnd LSP_BThe mapping model parameter w between_i、b_iThe specific process is as follows:

defining line spectrum pair coefficient LSP of k frame speech_AAnd LSP_BTraining data x and y, respectively, i.e. x (k, i) = LSP_A(k,i),y(k,i)=LSP_B(k, i); using n regression functions f_i(x)=w_i ^Tx+b_iFitting data { x (K, i), y (K, i) }, K =1, …, K, i =1, …, n;

defining n mapping functions

Wherein, a_kiAnd a_ki ^*Is the lagrange factor; for a given i, the solution for the lagrangian factor is:

defining the Lagrange function:

wherein C is_iIs constant and C_i>0，ζ_kiNot less than 0 andis relaxation factor, is fitting accuracy; maximizing the objective function:

wherein the Lagrange factor a_kiAnd a_ki ^*Satisfy the requirement of

A typical quadratic programming problem is formed, using the KKT condition: alpha is alpha_ki=0→y_kif(x_ki)≥1；0<α_ki<C→y_kif(x_ki)=1；α_i=C→y_kif(x_ki) And (3) solving the quadratic programming problem by using a sequence minimum optimization algorithm with the steps of:

1) given an initial value of the Lagrangian factor, typically take alpha_ki=0；

2) Calculating KKT conditions for the training data, finding data points (x) that violate the KKT conditions_1i,y_1i) Corresponding Lagrange factor alpha_1iIt is taken as one of two Lagrangian factors to be optimized;

3) finding satisfaction of max | f in training data_i(x_1i)–f_i(x_2i)+y_2i-y_1iData point of | (x)_2i,y_2i) Corresponding Lagrange factor as alpha_2i. Lagrange factor alpha_1iAnd alpha_2iAfter the selection is finished, other Lagrange factors are kept unchanged to form a minimum-scale quadratic programming problem, namely, the optimal quadratic programming problem is solvedAnd

4) solving the minimum quadratic programming problem:

K_{11} = x_{1 i}^{2},

K_{22} = x_{2 i}^{2},

K_{12} = x_{1 i} x_{2 i},

wherein E_ki=f_i ^old(x_ki)–y_kiTraining errors;

when y is₁＝y₂When the temperature of the water is higher than the set temperature,

<math> <mrow> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mrow> <mi>new</mi> <mo>,</mo> <mi>clipped</mi> </mrow> </msubsup> <mo>=</mo> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <mi>H</mi> <mo>,</mo> </mtd> <mtd> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mi>new</mi> </msubsup> <mo>&GreaterEqual;</mo> <mi>H</mi> </mtd> </mtr> <mtr> <mtd> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mi>new</mi> </msubsup> <mo>,</mo> </mtd> <mtd> <mi>L</mi> <mo><</mo> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mi>new</mi> </msubsup> <mo><</mo> <mi>H</mi> </mtd> </mtr> <mtr> <mtd> <mi>L</mi> <mo>,</mo> </mtd> <mtd> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mi>new</mi> </msubsup> <mo>≤</mo> <mi>L</mi> </mtd> </mtr> </mtable> </mfenced> <mo>,</mo> </mrow> </math>

<math> <mrow> <msubsup> <mi>α</mi> <mrow> <mn>1</mn> <mi>i</mi> </mrow> <mi>new</mi> </msubsup> <mo>=</mo> <msubsup> <mi>α</mi> <mrow> <mn>1</mn> <mi>i</mi> </mrow> <mi>old</mi> </msubsup> <mo>+</mo> <msub> <mi>y</mi> <mrow> <mn>1</mn> <mi>i</mi> </mrow> </msub> <msub> <mi>y</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mi>old</mi> </msubsup> <mo>-</mo> <msubsup> <mi>α</mi> <mrow> <mn>2</mn> <mi>i</mi> </mrow> <mrow> <mi>new</mi> <mo>,</mo> <mi>clipped</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>,</mo> </mrow> </math>

obtaining a pair of new Lagrange factorsAnd

5) checking whether a data point violating the KKT condition exists, and if yes, returning to the step 2); otherwise, obtaining the optimal solution of the whole problem and carrying out the next step;

6) obtaining a regression function:

wherein:

n is the number of support vectors;

when transcoding is carried out, a mapping model established by a support vector regression algorithm is used for carrying out line spectrum to coefficient mapping according to a group of received line spectrum to coefficients to obtain a group of line spectrum to coefficients required by a B encoder, and the line spectrum to coefficients are used as unquantized line spectrum to coefficients of a B encoding standard;

converting the obtained unquantized line spectrum pair coefficient into a line spectrum frequency coefficient, carrying out quantization coding according to a B coding standard, and then sending the line spectrum frequency coefficient to a communication network 2;

according to the B coding standard of the communication network 2, interpolating a group of line spectrum pair coefficients obtained by mapping with the line spectrum pair coefficients of the previous frame or several frames to obtain unquantized line spectrum pair coefficients of each subframe, calculating the unquantized linear prediction coefficients A (z) of each subframe according to the unquantized line spectrum pair coefficients of each subframe, and performing the B coding standard on the group of line spectrum pair coefficients obtained by mappingQuantizing to obtain a group of quantized line spectrum pair coefficients of the current frame, interpolating the quantized line spectrum pair coefficients of the current frame with the quantized line spectrum pair coefficients of the previous frame or several frames to obtain the quantized line spectrum pair coefficients of each subframe, calculating the quantized linear prediction coefficients A '(z) of each subframe from the quantized line spectrum pair coefficients of each subframe, wherein the unquantized linear prediction coefficients A (z) and the quantized linear prediction coefficients A' (z) are respectively used for calculating a perceptual weighting filter W (z) = A (z/gamma) =₁)/A(z/γ₂) And the coefficients of the synthesis filter 1/A' (z), said gamma₁And gamma₂Is a perceptual weighting factor;

(b) open-loop pitch search:

in a speech coding algorithm based on code excited linear prediction, pitch search is done in two steps. The first step is open-loop pitch search, approximately estimating the pitch period, which is marked as T _ op, and aiming at providing a rough range for closed-loop pitch search to reduce the calculated amount of closed-loop pitch search, and the second step is closed-loop pitch search near the T _ op.

When transcoding is performed, the normal open-loop pitch search process is omitted, and the decoded pitch-delayed integer part T0 is used as the open-loop pitch search result T _ op for B-coding standard coding:

T_op_B=T0_A,

(c) the impulse response of the perceptual weighted synthesis filter and the target signal of the adaptive codebook search are calculated,

perceptual weighted synthesis filter h (z) = a (z/γ)₁)/(A'(z)A(z/γ₂) Impulse response h (n) for adaptive codebook and fixed codebook search, typically once per subframe. The impulse signal passes through a filter A (z/gamma)₁) Then, the mixture is successively passed through 1/A' (z) and 1/A (z/gamma)₂) Obtaining h (n);

x(n)＝res_LP(n)*h(n)；

(d) the adaptive codebook search is performed in a manner such that,

the adaptive codebook search includes closed loop pitch search and calculation of an adaptive codebook vector;

wherein x (n) is a target signal, y_k(n) is the past filtered excitation at delay k, i.e. the convolution of the past excitation with h (n), len is the sub-frame length;

[T0-g₁(T0),T0+g₂(T0)]，

wherein, g₁、g₂Respectively, as a function of T0;

and carrying out closed-loop pitch search within a limited range to obtain the optimal integer pitch delay k, and if the resolution of k is within the range of fractional delay according to the coding standard of a receiving end, testing the fraction near the optimal integer delay. Interpolating the normalized correlation coefficient R (k) and searching the maximum value thereof to obtain a fractional pitch period;

where t =0,1, …, -1, b is the reciprocal of the fractional delay resolution_mAfter pitch lag determination, interpolating the past excitation u (n) at a given integer delay k and fractional delay t to compute an adaptive codebook vector:

Where len is the sub-frame length, x (n) is the target signal of adaptive codebook search, and the convolution y (n) of v (n) and h (n) is the adaptive codebook vector filtering signal, i.e. y (n) = v (n) × h (n), where h (n) is the impulse response of the perceptual weighted synthesis filter h (z);

(e) fixed codebook search

The fixed codebook vector can be expressed as:

wherein (N) is a unit pulse, N_pThe number of non-zero pulses in the fixed codebook vector, len is the subframe length. m is₁,m₂,...,Indicating the position of a non-zero pulse, S₁,S₂,...,The symbol (1 or-1) representing the non-zero pulse at the corresponding position, c (N) being a len-dimensional vector, divided by N_pExcept for the non-zero pulse, other elements are all 0;

fixed codebook search is with the reconstructed speech s 'of the decoding side weighted'_w(n) and the weighted mean square error minimization criterion between the reconstructed speech at the encoding end to search the fixed codebook vector, i.e. to determine the position and sign of the non-zero pulse in the codebook vector;

the fixed codebook search is carried out by firstly calculating target signal

x₂(n)＝x(n)-g_py(n),n＝0,1,Λ,len-1;

Wherein x (n) is the target signal of adaptive codebook search, y (n) = v (n) × h (n) is the adaptive codebook vector filtering signal, g_pFor adaptive codebook gain, len subframe length;

wherein d is x₂(n) and the impulse response h (n) of the perceptually weighted synthesis filter,is the autocorrelation matrix of h (n), T represents the matrix transposition;

the elements of vector d are calculated as follows:

wherein u is_iIndicating the position of the ith pulse, S_iSymbol representing the ith pulse, N_pThe number of non-zero pulses in the fixed codebook vector; the denominator in formula (1) is given by:

u maximizing formula (1)₁,u₂,...The calculated non-zero pulse position is obtained;

position m of fixed codebook vector obtainable by decoding at transcoding₁,m₂,Λ,Limiting the search range of fixed codebook search according to the coding standard of the receiving end to make the search of fixed codebook in m₁,m₂,Λ,Performing a simplified search near the location;

searching the codebook with fixed codebook gain with minimum mean square weighted error between the reconstructed speech decoded by the a-decoder and the reconstructed speech encoded by the B-encoder, even if the following equation is minimum:

E = {| | x - g_{p} y - g_{c} z | |}^{2} = x^{T} x + g_{p}^{2} y^{T} y + g_{c}^{2} z^{T} z - 2 g_{p} x^{T} y - 2 g_{c} x^{T} z + 2 g_{p} g_{c} y^{T} z,

wherein len is the subframe length;

(2) if the received frame type is a silence insertion description frame, the transcoding process is as follows:

(a) interpolating the quantized line spectrum pair coefficients of the current frame and the last silence insertion description frame obtained by decoding as the unquantized line spectrum pair coefficients of the corresponding frame at the encoding end of the B encoding standard:

wherein, LSP⁽¹⁾、LSP⁽⁰⁾Respectively representing the line spectrum pair coefficients of the current frame and the last silence insertion description frame, wherein n is the dimension of the line spectrum pair coefficients, and alpha is an interpolation coefficient;

(b) converting the decoded energy parameter ener into an energy parameter of a corresponding frame at a coding end of the B coding standard:

ener_B＝a·ener_A+b，

wherein a and b are linear fitting coefficients.

5. The method of claim 1, wherein the transcoding method comprises: the coding unit is used for carrying out quantization coding on the obtained parameters, and the specific steps are as follows:

(1) if the current frame is a voice frame, the parameters comprise a line spectrum pair coefficient, a pitch delay, a non-zero pulse position and symbol of a fixed codebook, a self-adaptive codebook gain and a fixed codebook gain, and each parameter is quantized and coded according to the coding standard of the communication network 2 to obtain a parameter bit;

(2) if the current frame is a silence insertion description frame, the parameters are line spectrum pair coefficients and voice energy, and the parameters are quantized and coded according to the coding standard of the communication network 2 to obtain parameter bits.

6. The method of claim 1, wherein the transcoding method comprises: the bit stream encapsulation unit is used for packing and outputting the parameter bit, the mode information and the frame type, wherein the output frame type is assigned according to the received frame type, so that the input frame type and the output frame type are the same, namely the received data frame is a voice frame, and the output frame type is also the voice frame; if the received data frame is a mute frame, the output frame type is also a mute frame, and the frame type is not judged according to the reconstructed voice.