WO2012159370A1 - Procédé et dispositif d'amélioration vocale - Google Patents

Procédé et dispositif d'amélioration vocale Download PDF

Info

Publication number
WO2012159370A1
WO2012159370A1 PCT/CN2011/078087 CN2011078087W WO2012159370A1 WO 2012159370 A1 WO2012159370 A1 WO 2012159370A1 CN 2011078087 W CN2011078087 W CN 2011078087W WO 2012159370 A1 WO2012159370 A1 WO 2012159370A1
Authority
WO
WIPO (PCT)
Prior art keywords
linear prediction
coefficient
prediction coefficients
lifting factor
prediction coefficient
Prior art date
Application number
PCT/CN2011/078087
Other languages
English (en)
Chinese (zh)
Inventor
田薇
李玉龙
邝秀玉
贺知明
Original Assignee
华为技术有限公司
电子科技大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司, 电子科技大学 filed Critical 华为技术有限公司
Priority to PCT/CN2011/078087 priority Critical patent/WO2012159370A1/fr
Priority to CN201180001446.0A priority patent/CN103038825B/zh
Publication of WO2012159370A1 publication Critical patent/WO2012159370A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information

Definitions

  • Embodiments of the present invention relate to the field of communications, and in particular, to a voice enhancement method and apparatus. Background technique
  • the Tandem scheme When the Tandem scheme is used for code stream conversion, the speech quality is impaired due to the inclusion of two distortion compressions, and the objective Mean Opinion Score (MOS) decreases, which affects the intelligibility of the speech.
  • MOS Mean Opinion Score
  • the Transcoding scheme can greatly reduce the amount of computation.
  • the voice quality due to the mismatch between the rates of the two streams, the voice quality is still impaired after the stream conversion, and the speech is understandable.
  • the degree of decline that is, the level of recognition of speech decreases.
  • One technical problem to be solved by the present invention is to overcome the shortcomings of the prior art in improving the speech intelligibility while reducing the speech quality, and to provide a high effect by using the formant and the medium and high frequency components of the speech to the intelligibility of the speech.
  • a speech enhancement method for frequency compensation is to overcome the shortcomings of the prior art in improving the speech intelligibility while reducing the speech quality, and to provide a high effect by using the formant and the medium and high frequency components of the speech to the intelligibility of the speech.
  • a speech enhancement method comprising: acquiring M first linear prediction coefficients of a voiced frame signal, where M is an order of a linear prediction filter;
  • Obtaining a lifting factor wherein the lifting factor is obtained according to a correlation between frequencies in a short-term spectral envelope corresponding to the M first linear prediction coefficients; Modifying the M first linear prediction coefficients according to the correlation between the lifting factor and the M first linear prediction coefficients, so that the second short time corresponding to the M second linear prediction coefficients obtained after the modification is performed.
  • the spectral envelope is enhanced compared to the first short-time spectral envelope corresponding to the M first linear prediction coefficients, and the mid-high frequency spectral components are compensated to some extent.
  • a voice enhancement device includes: an ear module, M first linear prediction coefficients for acquiring a voiced frame signal, where M is an order of the linear prediction filter ;
  • a processing module configured to obtain a lifting factor, where the lifting factor is obtained according to a correlation between frequencies in a short-term spectral envelope corresponding to the M first linear prediction coefficients;
  • a synthesizing module modifying the M first linear prediction coefficients according to the correlation between the lifting factor and the M first linear prediction coefficients, so that the M corresponding second linear prediction coefficients obtained by the modification correspond to Compared with the first short-time spectral envelope corresponding to the M first linear prediction coefficients, the two short-time spectral envelopes are enhanced in the formant energy and the medium-high frequency spectral components are compensated to some extent.
  • the lifting factor includes the correlation between the frequencies of the speech, and the modification of the speech short-term spectral envelope is obtained by modifying the M first linear prediction coefficients, and also includes the correlation of the speech, so that The modified short-time spectral envelope has its formant energy enhanced and the mid-high frequency spectral components of speech loss are compensated to some extent.
  • the effect of the resonance energy on the speech quality and the contribution of the high-frequency spectral components in the speech to the speech intelligibility, after the processing of the method of the embodiment of the present invention, the quality and intelligibility of the speech are jointly improved.
  • the speech enhancement method according to the embodiment of the present invention has a simple calculation process, good robustness, can simultaneously improve the intelligibility and quality of speech, and can recover high frequency components lost due to coding distortion, and is particularly suitable for improving convergence and intercommunication of different gateways. The resulting deterioration in communication voice quality.
  • FIG. 3 is a comparison of the voiced frames in the frequency domain after the cascading scheme and the voice enhancement method of the embodiment of the present invention, wherein FIG. 3(a) is the original voice, and FIG. 3(b) is the original voice processed by the cascading scheme.
  • Figure 3 (c) is a frequency distribution after the cascaded speech is processed by the speech enhancement method of the embodiment of the present invention;
  • FIG. 5 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
  • Figure 6 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
  • FIG. 7 is a schematic hardware structural diagram of a device for implementing an embodiment of the present invention. detailed description
  • the technical solution of the present invention can be applied to various communication systems, such as: GSM, Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), general packet Wireless Service (GPRS, General Packet Radio Service), Long Term Evolution (LTE), etc.
  • GSM Global System for Mobile Communications
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • GPRS General Packet Radio Service
  • LTE Long Term Evolution
  • FIG. 1 is a flow chart of a method 100 for enhancing voice transmission in accordance with an embodiment of the present invention. As shown in FIG. 1, the method 100 includes:
  • the acquired voiced frame can be set as the transfer function of the voice transmission.
  • M is the order of the linear prediction filter and is the first linear prediction coefficient.
  • the boosting factor is obtained based on the correlation between the frequencies in the short-term spectral envelope corresponding to the M first linear predictive coefficients.
  • the first linear prediction coefficient is calculated according to the following formula:
  • the short-term spectral envelope of the speech frame can be defined as:
  • Step 130 is specifically described below, that is, the first linear prediction coefficients are modified according to the correlation between the lifting factor and the first linear prediction coefficients, so that the second linear prediction coefficients obtained after the modification are modified.
  • the corresponding second short-term spectral envelope is enhanced compared with the first short-time spectral envelope corresponding to the first linear prediction coefficients, and the mid-high frequency spectral components are compensated to some extent.
  • the first linear prediction coefficient of the input speech frame signal is normalized as follows:
  • the voiced frame signal can be linearly filtered by using equation (15), thereby obtaining an intelligibility improvement.
  • Voice frame signal ⁇ i xy(n - i)
  • the method of the embodiment of the present invention may include the process of determining whether the voice frame is a voiced frame, and only the voice frame is voiced.
  • the voice frame is processed according to the method of the embodiment of the present invention, and when the voice frame is an unvoiced frame, the output is directly output, thereby saving processing resources and improving processing efficiency.
  • the speech frame signal may be pre-emphasized, for example, pre-emphasized according to equation (16):
  • FIG. 2 is an LPC spectrum of a voiced frame processed using the prior art cascade scheme and the voice enhancement method of the embodiment of the present invention.
  • the LPC spectrum of the voiced frames processed by the speech enhancement method of the present invention is generally enhanced, including not only the enhancement of the formant energy.
  • FIG. 3 is a comparison of the voiced frames in the frequency domain after the cascading scheme and the voice enhancement method of the embodiment of the present invention, wherein FIG. 3(a) is the original voice, and FIG. 3(b) is the original voice processed by the cascading scheme.
  • FIG. 3(c) is a speech enhancement after cascading speech through an embodiment of the present invention The frequency distribution after the method is processed. It can be seen from the comparison of Figs. 3(b) and 3(c) that after the speech enhancement method of the embodiment of the present invention, the medium and high frequency components in the original speech are significantly compensated.
  • FIG. 4 is a DRT score of the original speech, the concatenated processed speech, and the speech processed according to the method of the embodiment of the present invention.
  • 0 denotes original speech
  • I denotes speech after one cascade processing
  • II denotes a speech frame after secondary concatenation processing
  • III denotes a speech frame after three sub-continuous processing
  • ell denotes according to the present
  • the method of the embodiment of the invention processes the secondary concatenated speech frame
  • elll represents the method for processing the three sub-linked speech frames according to the method of the embodiment of the invention. Comparing III and elll, it can be seen that DRT can be increased by up to 6.26% after being processed by the method of the embodiment of the present invention.
  • the lifting factor includes the correlation between the frequencies of the speech, and the modification of the speech short-term spectral envelope is obtained by modifying the M first linear prediction coefficients, and also includes the correlation of the speech, so that The modified short-time spectral envelope has its formant energy enhanced and the mid-high frequency spectral components of speech loss are compensated to some extent.
  • the effect of the resonance energy on the speech quality and the contribution of the high-frequency spectral components in the speech to the speech intelligibility, after the processing of the method of the embodiment of the present invention, the quality and intelligibility of the speech are jointly improved.
  • the calculation process is simple and robust. Since the correlation between the respective frequencies of the speech is utilized, the prior art can solve the problem of processing the distortion formant enhancement or the resonance peak information loss, and can well recover the high loss due to different network fusion. Frequency component.
  • FIG. 5 is a schematic structural diagram of a voice enhancement device 200 according to an embodiment of the present invention.
  • the speech enhancement device can be used to implement the methods of embodiments of the present invention.
  • the voice enhancement device 200 includes: an acquisition module 210, configured to acquire M first linear prediction coefficients of a voiced frame signal, where M is a P-means of the linear prediction filter;
  • the processing module 220 is configured to obtain a lifting factor, where the lifting factor is obtained according to a correlation between frequencies in a short-term spectral envelope corresponding to the M first linear prediction coefficients;
  • the synthesizing module 230 is configured to modify the M first linear prediction coefficients according to the correlation between the lifting factor and the M first linear prediction coefficients, so that the M second linear prediction coefficients obtained after the modification correspond to The second short-term spectral envelope is enhanced compared to the first short-time spectral envelope corresponding to the M first linear prediction coefficients, and the mid-high frequency spectral components are compensated to some extent.
  • the obtaining module 210 is configured to calculate the first linear prediction coefficient by using a Levinson-Dubin recursive algorithm according to an autocorrelation function of the voiced frame.
  • the processing module is configured to calculate the lifting factor according to the above formulas (10) - (12).
  • the synthesizing module is configured to modify the first linear prediction coefficient by using the above formula (13) to obtain the second linear prediction coefficient.
  • the speech enhancement apparatus 200 further includes a filtering module 240 for linearly filtering the voiced frame signal according to the second linear prediction coefficient, according to an embodiment of the present invention.
  • the voice enhancement device 200 further includes a pre-emphasis module 250, configured to use the foregoing formula (16) before the acquiring module acquires M first linear prediction coefficients of the voiced frame signal according to an embodiment of the present invention.
  • the voiced frame signal is pre-emphasized.
  • the acquiring module may be configured to determine whether the voice frame is a voiced frame, and only if the voice frame is a voiced frame, the voice frame is processed according to the method of the embodiment of the present invention, and the voice frame is processed. In the case of unvoiced frames, direct output is used to save processing resources and improve processing efficiency.
  • the speech enhancement device 200 can be implemented by using various hardware devices, such as a digital signal processing (DSP) chip, wherein the obtained module 210
  • DSP digital signal processing
  • the processing module 220, the synthesizing module 230, and the filtering module 240 may each be implemented based on separate hardware devices, or may be integrated into one hardware device.
  • FIG. 7 is a schematic hardware architecture 700 of a speech enhancement device 200 for implementing an embodiment of the present invention.
  • the hardware structure 700 includes a DSP chip 710, a memory 720, and an interface unit 730.
  • the DSP chip 710 can be used to implement the processing functions of the voice enhancement device 200 of the embodiment of the present invention, including the processing functions of the acquisition module 210, the processing module 220, the synthesis module 230, and the filtering module 240.
  • the memory 720 can be used to store the voiced frame signals to be processed and intermediate variables of the processing and processed voiced frame signals and the like.
  • the interface unit 730 can be used for data transmission with a subordinate device.
  • the disclosed systems, devices, and methods may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not executed.
  • the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical, mechanical or otherwise.
  • the components displayed for the unit may or may not be physical units, ie may be located in one place, or may be distributed over multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium.
  • the technical solution of the present invention which is essential to the prior art or part of the technical solution, may be embodied in the form of a software product stored in a storage medium, including
  • a plurality of instructions are used to make a computer device (which may be a personal computer, a server, and the storage medium includes: a USB flash drive, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM, a random access memory).
  • Memory a variety of media such as a disk or a disc that can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

La présente invention concerne un procédé et un dispositif d'amélioration vocale. Le procédé d'amélioration vocale consiste à : acquérir M premiers coefficients de prédiction linéaire d'un signal de trame de son voisé, M étant l'ordre d'un filtre de prédiction linéaire ; acquérir un facteur d'extension, le facteur d'extension étant obtenu en fonction de la pertinence entre les fréquences dans l'enveloppe de spectre de courte durée correspondant aux M premiers coefficients de prédiction linéaire ; modifier les M premiers coefficients de prédiction linéaire en fonction de la pertinence entre le facteur d'extension et les M premiers coefficients de prédiction linéaire de telle sorte que l'énergie des formants d'une seconde enveloppe de spectre de courte durée correspondant à M seconds coefficients de prédiction linéaire obtenus après modification est améliorée et les composantes du spectre de moyenne-haute fréquence de celle-ci sont compensées dans une certaine mesure par rapport à la première enveloppe de spectre de courte durée correspondant aux M premiers coefficients de prédiction linéaire. Compte tenu de l'effet déterminant de l'énergie des formants sur la qualité du timbre de la voix et la contribution à l'intelligibilité des phrases de la voix par les composantes du spectre de moyenne-haute fréquence de la voix, après le traitement du procédé des modes de réalisation de la présente invention, la qualité et l'intelligibilité de la voix sont ainsi toutes deux améliorées.
PCT/CN2011/078087 2011-08-05 2011-08-05 Procédé et dispositif d'amélioration vocale WO2012159370A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2011/078087 WO2012159370A1 (fr) 2011-08-05 2011-08-05 Procédé et dispositif d'amélioration vocale
CN201180001446.0A CN103038825B (zh) 2011-08-05 2011-08-05 语音增强方法和设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/078087 WO2012159370A1 (fr) 2011-08-05 2011-08-05 Procédé et dispositif d'amélioration vocale

Publications (1)

Publication Number Publication Date
WO2012159370A1 true WO2012159370A1 (fr) 2012-11-29

Family

ID=47216591

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/078087 WO2012159370A1 (fr) 2011-08-05 2011-08-05 Procédé et dispositif d'amélioration vocale

Country Status (2)

Country Link
CN (1) CN103038825B (fr)
WO (1) WO2012159370A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI555010B (zh) * 2013-12-16 2016-10-21 三星電子股份有限公司 音訊編碼方法及裝置、音訊解碼方法以及非暫時性電腦可讀記錄媒體

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3062945B1 (fr) * 2017-02-13 2019-04-05 Centre National De La Recherche Scientifique Methode et appareil de modification dynamique du timbre de la voix par decalage en frequence des formants d'une enveloppe spectrale
CN106856623B (zh) * 2017-02-20 2020-02-11 鲁睿 基带语音信号通讯噪声抑制方法及系统
CN109147806B (zh) * 2018-06-05 2021-11-12 安克创新科技股份有限公司 基于深度学习的语音音质增强方法、装置和系统
CN110797039B (zh) * 2019-08-15 2023-10-24 腾讯科技(深圳)有限公司 语音处理方法、装置、终端及介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1619646A (zh) * 2003-11-21 2005-05-25 三星电子株式会社 使用共振峰增强对话的方法和装置
US20100063808A1 (en) * 2008-09-06 2010-03-11 Yang Gao Spectral Envelope Coding of Energy Attack Signal
CN102044250A (zh) * 2009-10-23 2011-05-04 华为技术有限公司 频带扩展方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1619646A (zh) * 2003-11-21 2005-05-25 三星电子株式会社 使用共振峰增强对话的方法和装置
US20100063808A1 (en) * 2008-09-06 2010-03-11 Yang Gao Spectral Envelope Coding of Energy Attack Signal
CN102044250A (zh) * 2009-10-23 2011-05-04 华为技术有限公司 频带扩展方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI555010B (zh) * 2013-12-16 2016-10-21 三星電子股份有限公司 音訊編碼方法及裝置、音訊解碼方法以及非暫時性電腦可讀記錄媒體

Also Published As

Publication number Publication date
CN103038825B (zh) 2014-04-30
CN103038825A (zh) 2013-04-10

Similar Documents

Publication Publication Date Title
Li et al. Glance and gaze: A collaborative learning framework for single-channel speech enhancement
JP6790048B2 (ja) 時間領域デコーダにおける量子化雑音を低減するためのデバイスおよび方法
JP6374028B2 (ja) 音声プロファイルの管理および発話信号の生成
US9218820B2 (en) Audio fingerprint differences for end-to-end quality of experience measurement
US11605394B2 (en) Speech signal cascade processing method, terminal, and computer-readable storage medium
JP7297368B2 (ja) 周波数帯域拡張方法、装置、電子デバイスおよびコンピュータプログラム
WO2021147237A1 (fr) Procédé et appareil de traitement de signal vocal, et dispositif électronique et support de stockage
WO2012159370A1 (fr) Procédé et dispositif d'amélioration vocale
WO2010066158A1 (fr) Procédés et appareils de codage et de décodage de signal et système de codage et de décodage
WO2013060223A1 (fr) Procédé et appareil de compensation de perte de trames pour signal à trames de parole
US20100106269A1 (en) Method and apparatus for signal processing using transform-domain log-companding
TW200828268A (en) Dual-transform coding of audio signals
WO2021052285A1 (fr) Appareil et procédé d'extension de bande de fréquence, dispositif électronique et support de stockage lisible par ordinateur
JP5326465B2 (ja) オーディオ復号方法、装置、及びプログラム
KR101924767B1 (ko) 음성 주파수 코드 스트림 디코딩 방법 및 디바이스
WO2005106850A1 (fr) Appareil de codage de hiérarchie et procédé de codage de hiérarchie
WO2023197809A1 (fr) Procédé de codage et de décodage de signal audio haute fréquence et appareils associés
EP2774148A1 (fr) Extension de largeur de bande de signaux audio
US20110137644A1 (en) Decoding speech signals
JP6573887B2 (ja) オーディオ信号の符号化方法、復号方法及びその装置
WO2010103854A2 (fr) Dispositif et procédé de codage de paroles, et dispositif et procédé de décodage de paroles
Vicente-Peña et al. Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition
CN112530446A (zh) 频带扩展方法、装置、电子设备及计算机可读存储介质
TW202103146A (zh) 語音編碼方法與電子裝置
CN112562710B (zh) 一种基于深度学习的阶梯式语音增强方法

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201180001446.0

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11866050

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11866050

Country of ref document: EP

Kind code of ref document: A1