CN102664021B

CN102664021B - Low-rate speech coding method based on speech power spectrum

Info

Publication number: CN102664021B
Application number: CN2012101195671A
Authority: CN
Inventors: 汤一彬; 张德国; 李枭雄; 单鸣雷; 朱昌平; 韩庆邦; 高远; 殷澄
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2012-04-20
Filing date: 2012-04-20
Publication date: 2013-10-02
Anticipated expiration: 2032-04-20
Also published as: CN102664021A

Abstract

The invention discloses a low-rate speech coding method based on a speech power spectrum, in particular to a speech processing technique based on signal sparse-representation and reconstruction of dictionary learning. A high-efficiency speech model taking the speech power spectrum as a main output parameter is adopted as a low-rate speech coding model. At a sending end, the speech power spectrum is output after a speech signal is processed, then the parameter is compressed through the sparse theory and is finally converted into a bit stream, and wireless transmission is realized. A dictionarylearning method at a receiving end is adopted, so that that the realization of the low-rate speech communication is guaranteed. The dictionary learning is maximized by utilizing all kinds of information of a former frame of synthetic speech. The match of a sparse coefficient based on energy and a dictionary atom is adopted, a measurement matrix is constructed so that the correctness of match is increased, and the optimal recovery of the speech power spectrum at the receiving end is realized.

Description

Low rate voice coding method based on the phonetic speech power spectrum

Technical field

The present invention relates to a kind of voice coding method of low rate, be specifically related to based on the signal rarefaction representation of dictionary study and the voice processing technology of reconstruct.

Background technology

In recent years, along with the development of data compression technique, the low rate speech coding technology also develops rapidly, has all obtained using widely in mobile communication, secret communication, underwater communication.Though the CELP wave coder still can produce high-quality voice when the 4.8kb/s bit rate, when bit rate is reduced to 4kb/s when following, the coded system performance sharply worsens.At this moment, must carry out data compression process efficiently to voice signal, and require this processing to adapt to the communications requirement.

At Boucheron, Laura E.; De Leon, Phillip L.; Sandoval, Steven.Low bit-rate speech coding through quantization of mel-frequency cepstral coefficients[J] .IEEE Transactions on Audio, Speech and Language Processing, 2012,20 (2): show in the document that 610-619. delivers, directly the Mei Er cepstrum coefficient that is transformed extraction by the phonetic speech power spectrum is carried out quantization encoding and more can effectively improve the voice coding quality.This method when code rate is 1.2kbps, can obtain than encode about 0.1 effect promoting with the stage enhancement type MELP (Mixed Excitation Linear Prediction) in PESQ test and appraisal, and it is then more considerable to promote effect when lower 0.6kbps, is about 0.25.And in the literary composition to the coding of Mei Er cepstrum coefficient, come down to one by large space (phonetic speech power spectral domain) data set to the data compression process of little space (cepstrum domain) data set.

To aspect effective expression of small data set, signal rarefaction representation and reconstruct theory are a kind of emerging signal indication means that occurred in recent years, can be used for a plurality of fields such as data mining, pattern classification, compressed sensing at large data sets.Signal rarefaction representation and reconstruct theory do not require accurate recovery raw data, but the sparse coefficient according to certain criterion searching minimum number approaches raw data to the full extent in certain basis set (dictionary) space, realizes data reconstruction.

Summary of the invention

The objective of the invention is signal rarefaction representation and the reconstruct theory of dictionary study are applied in the voice signal processing, structure is realized voice communication based on its theoretical low rate voice compression coding system.

Technical scheme of the present invention is considered from following two aspects: 1. sparse theoretical side, because voice signal itself has certain message structure, as harmonic information, if therefore dictionary can be by this peculiar structure of study picked up signal, the sparse coefficient that obtains at this dictionary must be the maximization performance of signal characteristic so, thereby reduce the redundance of output coefficient greatly, realize the purpose of efficient data compression.2. in communication aspects, in order to reduce code rate, then require to receive both sides and all must have identical dictionary, only transmit sparse coefficient during communication.At transmitting terminal, according to dictionary raw data is decomposed, obtain sparse coefficient; At receiving end, then the sparse coefficient by receiving utilizes the dictionary of agreement to recover raw data, finishes communication process.Therefore, the present invention with above-mentioned both carry out combination, by adopt dictionary study at receiving end, realize data compression; By analyzing-synthetic method, obtain the dictionary identical with receiving end at transmitting terminal, realize communicating requirement.Finally, realize a kind of low rate voice coding method of receiving end dictionary study.

The present invention adopts the phonetic speech power spectrum as major parameter, in conjunction with carrying out data compression based on signal rarefaction representation and the reconstruct theory of dictionary study, be implemented in high-quality vocoder structural scheme under the 1.2kbps speed, can when same-code speed, obtain the synthetic speech quality more excellent than same level vocoder.

A kind of low rate voice coding method based on the phonetic speech power spectrum is characterized in that, comprises following steps:

(1) step of transmitting terminal coding: voice signal is by the speech model output parameter, and the parameter of output is handled through data, produces sparse coefficient, and converts bit stream to;

(2) step of receiving end decoding: the parameter that receives is carried out data handle, recover correlation parameter, and obtain final synthetic speech by the phonetic synthesis model based on the phonetic speech power spectrum.

The parameter that described transmitting terminal is exported behind speech model is fundamental tone, normalization phonetic speech power spectrum and three parameters of energy gain.

Described transmitting terminal transmits the form that described fundamental tone, energy gain and these three parameters of sparse coefficient of being produced by described normalization phonetic speech power spectrum by sparse theory are converted into bit stream.

The data of described transmitting terminal are handled and are adopted a plurality of modules to realize, described a plurality of modules comprise the preceding sparse coefficient of frame of sparse decomposing module, dictionary study module and buffer memory and preceding frame dictionary module.

The parameter that described receiving end receives is fundamental tone, energy gain and three parameters of sparse coefficient.

The data of described receiving end are handled and are adopted a plurality of modules to realize, described a plurality of modules comprise the preceding sparse coefficient of frame of sparse reconstructed module, dictionary study module and buffer memory and preceding frame dictionary module.

Described transmitting terminal dictionary study module, at dictionary learning algorithm identical with receiving end of transmitting terminal structure, the dictionary that produces this frame according to sparse coefficient and the dictionary of preceding some frames only.

Described receiving end dictionary study module, sparse coefficient and the dictionary of some frames make up this frame dictionary before utilizing; The study of this frame dictionary is divided into the study in two spaces: preceding some frames receive the space of signal correspondences, with and the study of complementary space.

The dictionary learning method of receiving end provides safeguard for the low rate voice communication is achieved, and the various information of frame synthetic speech are carried out maximized dictionary study before utilizing as far as possible.

Some frames receive the method for the space learning of signal correspondences and are before described, at n constantly, preceding frame are received the study of the dictionary subspace of signal,

\min_{D} \underset{i < n}{Σ} {| | D_{i} a_{i} - {\hat{D}}_{1} {\hat{a}}_{i} | |}_{2}^{2} + λ_{1} {| | {\hat{a}}_{i} | |}_{1}

|| || _pBe p norm, D _iAnd a _iBe respectively dictionary and the sparse coefficient of frame before receiving end,

With

Be dictionary and the corresponding sparse coefficient of learning out, λ ₁Be Lagrangian coefficient.

The method of described complementary space study, adopting the average speech power spectrum is supplementary, obtains the dictionary of complementary space,

\min_{{\hat{D}}_{2}, {\hat{b}}_{i}} \underset{i \leq K}{Σ} {||\begin{matrix} {\overset{&OverBar;}{P}}_{i} - [\begin{matrix} {\hat{D}}_{1} & {\hat{D}}_{2} \end{matrix}] {\hat{b}}_{i} \end{matrix}||}_{2}^{2} + λ_{2} {| | {\hat{b}}_{i} | |}_{1}

With

Be respectively dictionary and corresponding sparse coefficient that complementary space is learnt out, under each pitch period, the training of phonetic speech power spectrum be divided into the K class,

Represent the average speech power spectrum of i class this moment, λ ₂Be Lagrangian coefficient.

Adopt sparse restructing algorithm to rebuild for data in described sparse reconstructed module, i.e. the data of the dictionary atom coupling method of rebuilding is by n dictionary D constantly _nSparse coefficient a with correspondence _nThe acquisition of multiplying each other, the sparse coefficient a of receiving end _nSparse position then mate acquisition from the angle of energy, recover normalization phonetic speech power spectrum by following formula,

\min_{a_{n}} {| | {| | D_{n} a_{n} | |}_{2}^{2} - 1 | |}_{2} .

D _nAnd a _nDifference n dictionary and corresponding sparse coefficient constantly.The present invention adopts following formula to mate from the angle of energy.

N dictionary D constantly _nBuilding method be:

D_{n} = [\begin{matrix} {\hat{D}}_{1} & {\hat{D}}_{2} \end{matrix}] \times φ

φ is at the stochastic matrix that receives and transmitting terminal is produced by the same seed number of arranging.And to dictionary D _nEach atom (namely to matrix D _nEach row) carry out energy normalized and handle.

The beneficial effect that the present invention reaches:

It is that the efficient speech model of main output parameter is the model of low rate voice coding that the present invention adopts with the phonetic speech power spectrum, at transmitting terminal, voice signal is exported the phonetic speech power spectrum after treatment, and this parameter is compressed by sparse theory subsequently, finally convert bit stream to, realize wireless transmission.Adopt the dictionary learning method of receiving end, provide safeguard for the low rate voice communication is achieved, and the various information of frame synthetic speech are carried out maximized dictionary study before utilizing as far as possible; Employing is based on the coupling of sparse coefficient and the dictionary atom of energy, and structure is measured matrix makes the correctness of coupling be improved, and is implemented in the optimized database restore of receiving end phonetic speech power spectrum.

Description of drawings

Fig. 1 is low rate voice coding frame diagram of the present invention;

Fig. 2 A is the low rate encoding and decoding speech scheme transmitting terminal coding block diagram of signal rarefaction representation of the present invention and reconstruct theory;

Fig. 2 B is the low rate encoding and decoding speech scheme receiving encoding block diagram of signal rarefaction representation of the present invention and reconstruct theory;

Fig. 3 A is the dictionary learning framework figure of receiving end of the present invention;

Fig. 3 B is that the dictionary of receiving end of the present invention is learnt concrete enforcement figure.

Embodiment

Below in conjunction with accompanying drawing, low rate encoding and decoding speech method of the present invention is further elaborated.

The low rate voice coding framework of the present invention design as shown in Figure 1, voice signal at first produces the model parameter of output by speech model, this model parameter is compressed by sparse theory subsequently, finally converts bit stream to, realizes wireless transmission.

Referring to Fig. 2 A, Fig. 2 B, provide the structural representation of low rate voice transmitting terminal scrambler of the present invention and receiving end demoder respectively.

Voice signal (8kHz sampling rate) is that a frame carries out the processing of branch frame with 25ms at first in the transmitting terminal coding block diagram of Fig. 2 A, compose and three parameters of energy gain by speech model output fundamental tone, normalization phonetic speech power: find the solution power spectrum as calculating with 512 FFT, but in order to be convenient to quantize and carry out follow-up signal rarefaction representation and reconstruct more, power spectrum parameters to be resolved into gain and normalized power compose two parts and handle.Transmitting terminal is transmitted by the form that quantizing encoder is converted into bit stream by these three parameters of sparse coefficient that normalization phonetic speech power spectrum produces with fundamental tone, energy gain with by sparse theory.

For the fundamental tone parameter, can adopt autocorrelation method to obtain, pitch period length is in 20 to 147 sampling point scopes.By signal rarefaction representation and the reconstruct part based on sparse theory in the frame of broken lines among the figure normalized power spectrum is carried out the data processing.Data are handled and are adopted three modules to realize, these three modules comprise sparse decomposing module, dictionary study module, the preceding sparse coefficient of frame of buffer memory and preceding frame dictionary module.The present invention constructs a dictionary learning algorithm identical with receiving end (the synthetic end of signal) at transmitting terminal (signal analysis end), and passes through certain algorithm at this dictionary, as basic tracing algorithm, tries to achieve sparse coefficient.In quantizing encoder, fundamental tone is carried out the uniform quantization of 7bit, energy gain is carried out the uniform quantization of 6bit at log-domain.Preceding 10 coefficients that sparse coefficient then is chosen for normalized power spectral factorization error energy minimum are the quantizer input parameter, these 10 parameters are at first carried out from greatly to minispread, and carry out the vector quantization of 17bit, and transmitting at last, concrete quantization method is the LBG algorithm.

In the receiving end decoding block diagram of Fig. 2 B, receiving end is at first decoded to these three parameters of fundamental tone, energy gain and sparse coefficient of receiving.At first the sparse coefficient that receives is handled by signal rarefaction representation and the reconstruct part based on sparse theory in the frame of broken lines among the figure.Data are handled and are adopted three modules to realize, these three modules comprise sparse reconstructed module, dictionary study module, the preceding sparse coefficient of frame of buffer memory and preceding frame dictionary module.

The dictionary study module utilizes the preceding sparse coefficient of frame of buffer memory and preceding frame dictionary module to make up this frame dictionary; The study of this frame dictionary is divided into the study in two spaces: preceding frame receives the space of signal correspondence, with and the study of complementary space.The dictionary of this frame is united generation by the sparse coefficient of preceding frame and dictionary and supplementary (as fundamental tone, the average power spectra that trains).The normalized power spectrum is then synthesized by relevant sparse restructing algorithm with dictionary by the sparse coefficient of this frame.Generation that power spectrum is composed by the normalized power that recovers and the gain energy multiplies each other.At last, phase information, power spectrum two parameters obtain final synthetic speech by the phonetic synthesis model based on the phonetic speech power spectrum.Phase information is herein obtained by the method based on the inverse Fourier transform in short-term (LSE-ISTFT) of least mean-square error, adopts this method can obtain final synthetic speech simultaneously.

The receiving end dictionary study of Fig. 3 B is concrete implements among the figure the sparse coefficient a of the preceding frame of buffer memory _iWith preceding frame dictionary D _iCarry out dictionary study, obtain the dictionary of learning out

Sparse coefficient with correspondence.In to the study of the dictionary of complementary space, supplementary average speech power spectrum is as training data, dictionary

As a part of dictionary, learn, obtain the dictionary of complementary space

Sparse coefficient with correspondence.Above two parts dictionary is merged, and the dictionary D that multiplies each other and construct this frame with stochastic matrix _nAnd the sparse coefficient of this frame and this frame dictionary can synthesize the normalized power spectrum by sparse restructing algorithm.

Gordian technique of the present invention is:

One, learns based on the dictionary of different subspace

If at the communication transmitting terminal original signal is carried out dictionary study, and sends sparse coefficient to receiving end, but at receiving end, sparse coefficient is only arranged and do not have corresponding dictionary information, so can't carry out data reconstruction, cause communication failure.The present invention adopts the algorithm of receiving end dictionary study, and transmitting terminal by analyze-synthetic method obtains the dictionary of receiving end, thereby realize that all there is mutually the same dictionary constantly in transmitting-receiving two-end at each, the final realization communicated by letter, and the dictionary study of receiving end as shown in Figure 3A.

For the dictionary study of receiving end, the present invention launches to learn to make the dictionary space to cover whole data space to the corresponding dictionary in different subspaces at receiving end.

The current dictionary of receiving end dictionary study must be divided into the study in two spaces: preceding frame receives the space of signal correspondence, with and complementary space.

At n constantly, preceding frame is received the study of the dictionary subspace of signal, generally can be expressed from the next:

\min_{D} \underset{i < n}{Σ} {| | D_{i} a_{i} - {\hat{D}}_{1} {\hat{a}}_{i} | |}_{2}^{2} + λ_{1} {| | {\hat{a}}_{i} | |}_{1} - - - (1)

In the formula (1), || || _pBe p norm, D _iAnd a _iBe respectively dictionary and the sparse coefficient of frame before receiving end,

With

Be dictionary and the corresponding sparse coefficient of learning out.

Adopt as the inverse matrix iteration scheduling algorithm of K-SVD, recurrence minimum variance and realize.λ ₁Be Lagrangian coefficient, general λ ₁Between 0～1.

The present invention increases the dictionary acquisition that certain supplementary is learnt complementary space.Adopting the average speech power spectrum is supplementary, is learnt by following formula:

In the formula (2),

With

Be respectively dictionary and corresponding sparse coefficient that complementary space is learnt out.If

Be the matrix of the capable n row of m, then In the space that constitutes of each row inevitable in the m dimension total space that is constituted by m row orthogonal basis, at this moment

It only is the subspace in the total space.And in the total space, remove Shared space segment then is called

Complementary space with respect to the total space.The present invention is divided into K class (K=64) with the training of phonetic speech power spectrum under each pitch period (fundamental tone number of cycles scope is 20～147 sampling points),

Represent the average speech power spectrum of i class this moment.The dictionary subspace

Can adopt with Identical mode obtains.λ ₂Be Lagrangian coefficient, general λ ₂Between 0～1.

N of the present invention is dictionary D constantly _nThe building method that adopts:

D_{n} = [\begin{matrix} {\hat{D}}_{1} & {\hat{D}}_{2} \end{matrix}] \times φ - - - (3)

In the formula (3), φ is the stochastic matrix that is produced by the same seed number of agreement at receiving end and transmitting terminal.

Two, rebuild based on the data of dictionary atom coupling

In sparse reconstructed module, adopt sparse restructing algorithm to rebuild for data, specifically adopt following algorithmic formula to recover normalization phonetic speech power spectrum, the dictionary atom is correctly mated with corresponding sparse coefficient:

\min_{a_{n}} {| | {| | D_{n} a_{n} | |}_{2}^{2} - 1 | |}_{2} - - - (4)

In the formula (4), D _nAnd a _nBe respectively n dictionary and corresponding sparse coefficient constantly.The present invention adopts formula (4) to mate from the angle of energy.Because dictionary D _nDestroyed its orthogonality by increasing stochastic matrix φ when structure, therefore having guaranteed can flux matched being achieved.

The above only is preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the technology of the present invention principle; can also make some improvement and distortion, these improvement and distortion also should be considered as protection scope of the present invention.

Claims

1. the low rate voice coding method based on the phonetic speech power spectrum is characterized in that, comprises following steps:

(2) step of receiving end decoding: the parameter that receives is carried out data handle, recover correlation parameter, and obtain final synthetic speech by the phonetic synthesis model based on the phonetic speech power spectrum;

The data of described transmitting terminal are handled and are adopted a plurality of modules to realize, described a plurality of modules comprise the preceding sparse coefficient of frame of sparse decomposing module, dictionary study module and buffer memory and preceding frame dictionary module;

The data of described receiving end are handled and are adopted a plurality of modules to realize, described a plurality of modules comprise the preceding sparse coefficient of frame of sparse reconstructed module, dictionary study module and buffer memory and preceding frame dictionary module;

Described transmitting terminal dictionary study module, at dictionary learning algorithm identical with receiving end of transmitting terminal structure, the dictionary that produces this frame according to sparse coefficient and the dictionary of preceding some frames only;

Described receiving end dictionary study module, sparse coefficient and the dictionary of some frames make up this frame dictionary before utilizing; The study of this frame dictionary is divided into the study in two spaces: preceding some frames receive the space of signal correspondences, with and the study of complementary space;

|| .|| _pBe p norm, D _iAnd a _iBe respectively dictionary and the sparse coefficient of frame before receiving end,

With

Be dictionary and the corresponding sparse coefficient of learning out, λ ₁Be Lagrangian coefficient;

With

Represent the average speech power spectrum of i class this moment, λ ₂Be Lagrangian coefficient;

N dictionary D constantly _nThe building method that adopts is:

φ is the stochastic matrix that is produced by the same seed number of agreement at receiving end and transmitting terminal;

At last to dictionary D _nEach atom, namely to matrix D _nEach row carry out energy normalized and handle;

In described sparse reconstructed module, adopt sparse restructing algorithm to rebuild for data, by n dictionary D constantly _nSparse coefficient a with correspondence _nThe acquisition of multiplying each other, the sparse coefficient a of receiving end _nSparse position then mate acquisition from the angle of energy, recover normalization phonetic speech power spectrum by following formula,

2. the low rate voice coding method based on the phonetic speech power spectrum according to claim 1 is characterized in that, the parameter that described transmitting terminal is exported behind speech model is fundamental tone, normalization phonetic speech power spectrum and three parameters of energy gain.

3. the low rate voice coding method based on phonetic speech power spectrum according to claim 2, it is characterized in that described transmitting terminal transmits the form that described fundamental tone, energy gain and these three parameters of sparse coefficient of being produced by described normalization phonetic speech power spectrum by sparse theory are converted into bit stream.

4. the low rate voice coding method based on the phonetic speech power spectrum according to claim 1 is characterized in that the parameter that described receiving end receives is fundamental tone, energy gain and three parameters of sparse coefficient.