WO2012159370A1

WO2012159370A1 - Voice enhancement method and device

Info

Publication number: WO2012159370A1
Application number: PCT/CN2011/078087
Authority: WO
Inventors: 田薇; 李玉龙; 邝秀玉; 贺知明
Original assignee: 华为技术有限公司; 电子科技大学
Priority date: 2011-08-05
Filing date: 2011-08-05
Publication date: 2012-11-29
Also published as: CN103038825A; CN103038825B

Abstract

The embodiments of the present invention relate to a voice enhancement method and device. The voice enhancement method includes: acquiring M first linear prediction coefficients of a voiced sound frame signal, wherein M is the order of a linear prediction filter; acquiring a raising factor, wherein the raising factor is obtained according to the relevance among the frequencies in the short-time spectrum envelope corresponding to the M first linear prediction coefficients; modifying the M first linear prediction coefficients according to the relevance between the raising factor and the M first linear prediction coefficients so that the formant energy of a second short-time spectrum envelope corresponding to M second linear prediction coefficients obtained after modification is enhanced and the medium-high frequency spectrum components thereof are compensated to a certain extent as compared to the first short-time spectrum envelope corresponding to the M first linear prediction coefficients. Given the determining effect of the formant energy on the tone quality of the voice and the contribution to the sentence intelligibility of the voice by the medium-high frequency spectrum components of the voice, after the processing of the method in the embodiments of the present invention, the quality of and intelligibility of the voice are improved together.

Description

Voice enhancement method and i equipment

Technical field

Embodiments of the present invention relate to the field of communications, and in particular, to a voice enhancement method and apparatus. Background technique

With the development of wireless technology, the phenomenon of mutual integration between networks is increasing. To achieve inter-network communication, different code streams need to be converted. To realize the convergence of IP telephony and mobile telephony, take the IP phone as an example (the G..723 and G..729 protocols used in the voice coding of IP telephony; and the comparison in the mobile communication field) Most of them are adaptive multi-rate speech coding (AMR) standards, and it is necessary to implement conversion between two different streams of G.729 and AMR. At present, there are mainly two schemes for conversion between streams, Tandem and Transcoding. When the Tandem scheme is used for code stream conversion, the speech quality is impaired due to the inclusion of two distortion compressions, and the objective Mean Opinion Score (MOS) decreases, which affects the intelligibility of the speech. Compared with the former scheme, the Transcoding scheme can greatly reduce the amount of computation. However, due to the mismatch between the rates of the two streams, the voice quality is still impaired after the stream conversion, and the speech is understandable. The degree of decline, that is, the level of recognition of speech decreases.

The advancement of speech intelligibility in the prior art may simultaneously amplify or introduce harsh noise, distortion or even distortion, and cannot recover lost high frequency components. That is to say, the improvement of speech intelligibility in the prior art is at the expense of the sacrifice of speech quality, that is, the current technology is difficult to achieve the common improvement of speech intelligibility and speech quality. Summary of the invention

One technical problem to be solved by the present invention is to overcome the shortcomings of the prior art in improving the speech intelligibility while reducing the speech quality, and to provide a high effect by using the formant and the medium and high frequency components of the speech to the intelligibility of the speech. A speech enhancement method for frequency compensation.

According to an embodiment of the present invention, a speech enhancement method is provided, the method comprising: acquiring M first linear prediction coefficients of a voiced frame signal, where M is an order of a linear prediction filter;

Obtaining a lifting factor, wherein the lifting factor is obtained according to a correlation between frequencies in a short-term spectral envelope corresponding to the M first linear prediction coefficients; Modifying the M first linear prediction coefficients according to the correlation between the lifting factor and the M first linear prediction coefficients, so that the second short time corresponding to the M second linear prediction coefficients obtained after the modification is performed The spectral envelope is enhanced compared to the first short-time spectral envelope corresponding to the M first linear prediction coefficients, and the mid-high frequency spectral components are compensated to some extent.

According to an embodiment of the present invention, a voice enhancement device is provided, where the device includes: an ear module, M first linear prediction coefficients for acquiring a voiced frame signal, where M is an order of the linear prediction filter ;

a processing module, configured to obtain a lifting factor, where the lifting factor is obtained according to a correlation between frequencies in a short-term spectral envelope corresponding to the M first linear prediction coefficients;

a synthesizing module, modifying the M first linear prediction coefficients according to the correlation between the lifting factor and the M first linear prediction coefficients, so that the M corresponding second linear prediction coefficients obtained by the modification correspond to Compared with the first short-time spectral envelope corresponding to the M first linear prediction coefficients, the two short-time spectral envelopes are enhanced in the formant energy and the medium-high frequency spectral components are compensated to some extent.

In the method of the embodiment of the present invention, the lifting factor includes the correlation between the frequencies of the speech, and the modification of the speech short-term spectral envelope is obtained by modifying the M first linear prediction coefficients, and also includes the correlation of the speech, so that The modified short-time spectral envelope has its formant energy enhanced and the mid-high frequency spectral components of speech loss are compensated to some extent. The effect of the resonance energy on the speech quality and the contribution of the high-frequency spectral components in the speech to the speech intelligibility, after the processing of the method of the embodiment of the present invention, the quality and intelligibility of the speech are jointly improved.

The speech enhancement method according to the embodiment of the present invention has a simple calculation process, good robustness, can simultaneously improve the intelligibility and quality of speech, and can recover high frequency components lost due to coding distortion, and is particularly suitable for improving convergence and intercommunication of different gateways. The resulting deterioration in communication voice quality. DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings to be used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only the present invention. For some embodiments, other drawings may be obtained from those of ordinary skill in the art without departing from the drawings.

1 is a flow chart of a method of an embodiment of the present invention;

2 is a cascading scheme using the prior art and a voice enhancement method using the embodiment of the present invention The LPC spectrum of the processed voiced frame;

3 is a comparison of the voiced frames in the frequency domain after the cascading scheme and the voice enhancement method of the embodiment of the present invention, wherein FIG. 3(a) is the original voice, and FIG. 3(b) is the original voice processed by the cascading scheme. Figure 3 (c) is a frequency distribution after the cascaded speech is processed by the speech enhancement method of the embodiment of the present invention;

4 is a DRT score of the original speech, the concatenated processed speech, and the speech processed according to the method of the embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an apparatus according to an embodiment of the present invention; FIG.

Figure 6 is a schematic structural diagram of an apparatus according to an embodiment of the present invention; and

FIG. 7 is a schematic hardware structural diagram of a device for implementing an embodiment of the present invention. detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without making creative labor are within the scope of the present invention.

The technical solution of the present invention can be applied to various communication systems, such as: GSM, Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), general packet Wireless Service (GPRS, General Packet Radio Service), Long Term Evolution (LTE), etc.

1 is a flow chart of a method 100 for enhancing voice transmission in accordance with an embodiment of the present invention. As shown in FIG. 1, the method 100 includes:

110: Obtain M first linear prediction coefficients of the voiced frame signal, where M is an order of the linear prediction filter;

120: Acquire a lifting factor, where the lifting factor is obtained according to a correlation between frequencies in a short-term spectral envelope corresponding to the M first linear prediction coefficients;

130: modifying the M first linear prediction coefficients according to the correlation between the lifting factor and the M first linear prediction coefficients, so that the modified second M prediction coefficients correspond to a second The short-time spectral envelope is enhanced compared with the first short-time spectral envelope corresponding to the M first linear prediction coefficients, and the mid-high frequency spectral components are supplemented to some extent. In 110, the acquired voiced frame can be set as the transfer function of the voice transmission.

Where M is the order of the linear prediction filter and is the first linear prediction coefficient.

Specifically, in 120, the boosting factor is obtained based on the correlation between the frequencies in the short-term spectral envelope corresponding to the M first linear predictive coefficients.

The first linear prediction coefficient is calculated according to the following formula:

i?„(j)-∑«A(i- = 0 \<j<M (2) where R„( ) is the autocorrelation function of the voiced frame at j, ie

R „U) =∑^)s(nj) (3) According to an embodiment of the present invention, the Levinson-Durbin recursive algorithm can be used to solve the equation (2), and the recursion process is as follows:

a. Calculated autocorrelation function R„ ( , 7 = 0,1... ;

b. Let E ^(Q = R _n (0);

C. The recursion process begins with Ϊ· = 1;

d. Perform a recursive operation according to the following formulas (4) - (6):

R _n (i)-∑a; ⁽ - ^r) R _n (ij)

e. i = i + l, if >M, the algorithm ends, otherwise returns to step (d), re-transfer in the above (4) - (6), α) '·) represents the i-th linear prediction filter The jth prediction coefficient of the device, E« is the predicted residual energy of the i-th linear predictive filter, and after recursion, the solutions of the predictors of the i=l, 2, and M stages can be obtained. The final solution is:

j = a _j ^(M) j = \,2,...,M (7) If _{z = (} , then the frequency characteristic of the model of the voiced frame signal is obtained, that is, the frequency response of the linear system of the speech generation model Can be described as: GG

H(e ^ia ) (8)

A(e ^ia )

i=l

According to the definition of the power spectrum, the short-term spectral envelope of the speech frame can be defined as:

H(e"") (9)

A(e ^lco )

Step 130 is specifically described below, that is, the first linear prediction coefficients are modified according to the correlation between the lifting factor and the first linear prediction coefficients, so that the second linear prediction coefficients obtained after the modification are modified. The corresponding second short-term spectral envelope is enhanced compared with the first short-time spectral envelope corresponding to the first linear prediction coefficients, and the mid-high frequency spectral components are compensated to some extent.

First, the first linear prediction coefficient of the input speech frame signal is normalized as follows:

(^—((^^( ^/?^ ?^, 二丄,?... ( 10) Reprocess it with a sinusoidal model:

At 0« _; ≥ 0, flagi = ( 11-1 )

At cc _; <0

-1; x _t < π

Flagi = 1; x _i > π ( 11-2)

0; x _i = π

Then, the lifting factor f is given by

(∑(Αα _8ί -μ))

f (12)

M

Where is the mean of the first linear prediction coefficient A, and M is the order of the linear prediction filter _c

It should be noted that using the normalized first linear prediction coefficient and the sinusoidal model of the voiced frame to obtain the lifting factor is merely an example, and those skilled in the art may select other methods to obtain the lifting factor according to specific situations.

Then, the above linear prediction coefficient is modified by the equation (13) to obtain the second linear prediction coefficient β,

= ι"··,'·_ι (13) Substituting the second linear prediction coefficient A obtained after the modification (9) The first linear prediction coefficient in the equation, the transfer function can be written as:

Wherein the speech frame outputted by the speech enhancement method of the embodiment of the present invention is represented,

y(n) = ^ _i xy(n - i)) + s(n) (15) According to an embodiment of the present invention, the voiced frame signal can be linearly filtered by using equation (15), thereby obtaining an intelligibility improvement. Voice frame signal.

It should be noted that the above modification of the first linear prediction coefficient according to the formula (13) according to the correlation between the lifting factor and the first linear prediction coefficient is merely an example, and those skilled in the art may select an appropriate method to modify the A linear prediction coefficient can be achieved as long as the resonance energy of the formant is enhanced and the medium and high frequency spectral components are compensated to a certain extent.

According to the embodiment of the present invention, in consideration of the fact that the formant of the speech frame is only present in the voiced frame, the method of the embodiment of the present invention may include the process of determining whether the voice frame is a voiced frame, and only the voice frame is voiced. In the case of a frame, the voice frame is processed according to the method of the embodiment of the present invention, and when the voice frame is an unvoiced frame, the output is directly output, thereby saving processing resources and improving processing efficiency.

According to an embodiment of the present invention, before step 110, the speech frame signal may be pre-emphasized, for example, pre-emphasized according to equation (16):

H(z) = l- 0.95z- ¹ ( 16 ) In this case, after the intelligibility of the input speech frame is improved, the opposite process is performed to eliminate the influence of the pre-emphasis.

According to the method of the embodiment of the present invention, in a specific application, the effect of the speech enhancement method of the embodiment of the present invention can be seen from FIG. 2 to FIG.

2 is an LPC spectrum of a voiced frame processed using the prior art cascade scheme and the voice enhancement method of the embodiment of the present invention. As can be seen from Fig. 2, the LPC spectrum of the voiced frames processed by the speech enhancement method of the present invention is generally enhanced, including not only the enhancement of the formant energy.

3 is a comparison of the voiced frames in the frequency domain after the cascading scheme and the voice enhancement method of the embodiment of the present invention, wherein FIG. 3(a) is the original voice, and FIG. 3(b) is the original voice processed by the cascading scheme. Frequency distribution, FIG. 3(c) is a speech enhancement after cascading speech through an embodiment of the present invention The frequency distribution after the method is processed. It can be seen from the comparison of Figs. 3(b) and 3(c) that after the speech enhancement method of the embodiment of the present invention, the medium and high frequency components in the original speech are significantly compensated.

4 is a DRT score of the original speech, the concatenated processed speech, and the speech processed according to the method of the embodiment of the present invention. In FIG. 4, 0 denotes original speech, I denotes speech after one cascade processing; II denotes a speech frame after secondary concatenation processing; III denotes a speech frame after three sub-continuous processing, and ell denotes according to the present The method of the embodiment of the invention processes the secondary concatenated speech frame, and elll represents the method for processing the three sub-linked speech frames according to the method of the embodiment of the invention. Comparing III and elll, it can be seen that DRT can be increased by up to 6.26% after being processed by the method of the embodiment of the present invention.

In addition, according to the method of the embodiment of the invention, the calculation process is simple and robust. Since the correlation between the respective frequencies of the speech is utilized, the prior art can solve the problem of processing the distortion formant enhancement or the resonance peak information loss, and can well recover the high loss due to different network fusion. Frequency component.

FIG. 5 is a schematic structural diagram of a voice enhancement device 200 according to an embodiment of the present invention. The speech enhancement device can be used to implement the methods of embodiments of the present invention. As shown in FIG. 2, the voice enhancement device 200 includes: an acquisition module 210, configured to acquire M first linear prediction coefficients of a voiced frame signal, where M is a P-means of the linear prediction filter;

The processing module 220 is configured to obtain a lifting factor, where the lifting factor is obtained according to a correlation between frequencies in a short-term spectral envelope corresponding to the M first linear prediction coefficients;

The synthesizing module 230 is configured to modify the M first linear prediction coefficients according to the correlation between the lifting factor and the M first linear prediction coefficients, so that the M second linear prediction coefficients obtained after the modification correspond to The second short-term spectral envelope is enhanced compared to the first short-time spectral envelope corresponding to the M first linear prediction coefficients, and the mid-high frequency spectral components are compensated to some extent. According to an embodiment of the present invention, the obtaining module 210 is configured to calculate the first linear prediction coefficient by using a Levinson-Dubin recursive algorithm according to an autocorrelation function of the voiced frame.

According to an embodiment of the invention, the processing module is configured to calculate the lifting factor according to the above formulas (10) - (12).

According to an embodiment of the present invention, the synthesizing module is configured to modify the first linear prediction coefficient by using the above formula (13) to obtain the second linear prediction coefficient.

As shown in FIG. 6, the speech enhancement apparatus 200 further includes a filtering module 240 for linearly filtering the voiced frame signal according to the second linear prediction coefficient, according to an embodiment of the present invention.

As shown in FIG. 6 , the voice enhancement device 200 further includes a pre-emphasis module 250, configured to use the foregoing formula (16) before the acquiring module acquires M first linear prediction coefficients of the voiced frame signal according to an embodiment of the present invention. The voiced frame signal is pre-emphasized.

According to the embodiment of the present invention, the acquiring module may be configured to determine whether the voice frame is a voiced frame, and only if the voice frame is a voiced frame, the voice frame is processed according to the method of the embodiment of the present invention, and the voice frame is processed. In the case of unvoiced frames, direct output is used to save processing resources and improve processing efficiency.

It should be understood by those skilled in the art that the speech enhancement device 200 according to the embodiment of the present invention can be implemented by using various hardware devices, such as a digital signal processing (DSP) chip, wherein the obtained module 210 The processing module 220, the synthesizing module 230, and the filtering module 240 may each be implemented based on separate hardware devices, or may be integrated into one hardware device.

FIG. 7 is a schematic hardware architecture 700 of a speech enhancement device 200 for implementing an embodiment of the present invention. As shown in FIG. 7, the hardware structure 700 includes a DSP chip 710, a memory 720, and an interface unit 730. The DSP chip 710 can be used to implement the processing functions of the voice enhancement device 200 of the embodiment of the present invention, including the processing functions of the acquisition module 210, the processing module 220, the synthesis module 230, and the filtering module 240. The memory 720 can be used to store the voiced frame signals to be processed and intermediate variables of the processing and processed voiced frame signals and the like. The interface unit 730 can be used for data transmission with a subordinate device.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the technical solution. The described functionality is implemented, but such implementation should not be considered to be beyond the scope of the invention.

It will be apparent to those skilled in the art that, for the convenience of the description and the cleaning process, the specific operation of the system, the device and the unit described above may be referred to the corresponding processes in the foregoing method embodiments, and details are not described herein again.

In the several embodiments provided herein, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not executed. In addition, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical, mechanical or otherwise. The components displayed for the unit may or may not be physical units, ie may be located in one place, or may be distributed over multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is essential to the prior art or part of the technical solution, may be embodied in the form of a software product stored in a storage medium, including A plurality of instructions are used to make a computer device (which may be a personal computer, a server, and the storage medium includes: a USB flash drive, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM, a random access memory). Memory ), a variety of media such as a disk or a disc that can store program code.

The above is only the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the claims.

Claims

Rights request

A voice enhancement method, comprising:

Obtaining M first linear prediction coefficients of the voiced frame signal, where M is an order of the linear prediction filter;

Obtaining a lifting factor, wherein the lifting factor is obtained according to a correlation between frequencies in a short-term spectral envelope corresponding to the M first linear prediction coefficients;

Modifying the M first linear prediction coefficients according to the correlation between the lifting factor and the M first linear prediction coefficients, so that the second short time corresponding to the M second linear prediction coefficients obtained after the modification is performed The spectral envelope is enhanced compared to the first short-time spectral envelope corresponding to the M first linear prediction coefficients, and the mid-high frequency spectral components are compensated to some extent.

2. The method of claim 1 wherein

And obtaining the M first linear prediction coefficients of the voiced frame signal, including:

The first linear prediction coefficient is calculated using a Levinson-Dubin recursive algorithm based on the autocorrelation function of the voiced frame.

3. The method of claim 1 wherein

The obtaining a lifting factor includes: calculating a lifting factor according to the following formula:

x _i = |α _; | - ((int)(|a _; | / 2 r) χ 2π), i = 1,2... M at cc _; ≥0

- 1; x _i < π

Flagi 1; x _i > π

π

Wherein, for the first linear prediction coefficient, ^ is a normalized first linear prediction coefficient, ^^ is a sinusoidal model value, is a mean of ⁰ ^, M is an order of linear prediction, / is the elevation factor.

4. A method according to any one of claims 1 to 3, characterized in that Modifying the M first linear prediction coefficients according to the correlation between the lifting factor and the M first linear prediction coefficients, including:

The first linear prediction coefficient is modified using the following formula to obtain the second linear prediction coefficient:

j = l,..., i - l where i is the i-th order coefficient in the M-order linear prediction filter; is the first linear prediction coefficient, representing the j-th linear prediction coefficient of the i-th linear prediction filter; The lifting factor; is the second linear prediction coefficient, representing the jth linear prediction coefficient of the i-th linear prediction filter.

The method according to any one of claims 1 to 4, wherein the method further comprises:

The voiced frame is linearly filtered according to the second linear prediction coefficient.

The method according to any one of claims 1 to 5, characterized in that

Before the obtaining the M first linear prediction coefficients of the voiced frame signal, the method further includes:

The voiced frame signal is pre-emphasized using the following equation:

H ( = 1 - 0·95ζ- ¹⁾ .

A voice enhancement device, the device comprising:

Obtaining a first linear prediction coefficient for obtaining a voiced frame signal, where Μ is the order of the linear prediction filter;

a processing module, configured to obtain a lifting factor, where the lifting factor is obtained according to a correlation between frequencies in a short-term spectral envelope corresponding to the first linear prediction coefficients;

a synthesizing module, modifying the first linear prediction coefficients according to the correlation between the lifting factor and the first linear prediction coefficients, so that the second linear prediction coefficients obtained after the modification correspond to The two short-time spectral envelopes are enhanced in the formant energy compared to the first short-time spectral envelope corresponding to the first linear prediction coefficients, and the mid-high frequency spectral components are compensated to some extent.

8. Apparatus according to claim 7 wherein:

The obtaining module is configured to calculate the first linear pre-J coefficient by using a Levinson-Dubin recursive algorithm according to an autocorrelation function of the voiced frame.

9. Apparatus according to claim 7 wherein:

The processing module is configured to calculate the lifting factor according to the following formula: x _i = |α _; |-((ίηΐ)(|α _; |/2π-)χ2π-), a ≥0

-1; x _t > π

Flagi 1; x _i < π

0; = π

∑( _{βα §ί} -μ))

f

M

Wherein, the first linear prediction coefficient is a normalized first linear prediction coefficient, ^^ is a sinusoidal model value, is a mean of ⁰ ^, M is an order of linear prediction, and / is the lifting factor .

10. Apparatus according to any one of claims 7 to 9, wherein

The synthesis module modifies the first linear prediction coefficient using a formula to obtain the second linear prediction coefficient:

= i,..., 1 (5)

Where i is the i-th order coefficient of the M-order linear prediction filter; is the first linear prediction coefficient, representing the j-th linear prediction coefficient of the i-th linear prediction filter; / is a lifting factor; is the second linear prediction coefficient , represents the jth linear prediction coefficient of the i-th linear prediction filter.

The device according to any one of claims 7 to 10, wherein the device further comprises:

And a filtering module, configured to linearly filter the voiced frame signal according to the second linear prediction coefficient.

And a pre-emphasis module, configured to pre-emphasize the voiced frame signal by using the following formula before the obtaining module acquires the M first linear prediction coefficients of the voiced frame signal:

H(z)=l-0.95z- ¹