WO2022228144A1 - 音频信号增强方法、装置、计算机设备、存储介质和计算机程序产品 - Google Patents

音频信号增强方法、装置、计算机设备、存储介质和计算机程序产品 Download PDF

Info

Publication number
WO2022228144A1
WO2022228144A1 PCT/CN2022/086960 CN2022086960W WO2022228144A1 WO 2022228144 A1 WO2022228144 A1 WO 2022228144A1 CN 2022086960 W CN2022086960 W CN 2022086960W WO 2022228144 A1 WO2022228144 A1 WO 2022228144A1
Authority
WO
WIPO (PCT)
Prior art keywords
filtering
excitation signal
long
signal
linear
Prior art date
Application number
PCT/CN2022/086960
Other languages
English (en)
French (fr)
Inventor
王蒙
黄庆博
肖玮
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2023535590A priority Critical patent/JP2023553629A/ja
Priority to EP22794615.9A priority patent/EP4297025A1/en
Publication of WO2022228144A1 publication Critical patent/WO2022228144A1/zh
Priority to US18/076,116 priority patent/US20230099343A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/005Correction of errors induced by the transmission channel, if related to the coding algorithm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/09Long term prediction, i.e. removing periodical redundancies, e.g. by using adaptive codebook or pitch predictor
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L2019/0001Codebooks
    • G10L2019/0011Long term prediction filters, i.e. pitch estimation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals

Definitions

  • the present application relates to the field of computer technology, and in particular, to an audio signal enhancement method, apparatus, computer equipment, storage medium and computer program product.
  • quantization noise is usually introduced, which makes the decoded and synthesized speech distorted.
  • pitch filter or post-processing technology based on neural network is usually used to enhance the audio signal, so as to reduce the influence of quantization noise on speech quality.
  • an audio signal enhancement method for improving audio signal enhancement method, apparatus, computer device, storage medium, and computer program product.
  • An audio signal enhancement method performed by computer equipment, the method comprising:
  • the voice excitation signal of the filter is subjected to speech enhancement processing to obtain an enhanced speech excitation signal
  • a speech enhancement signal is obtained by performing speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters.
  • the linear filtering parameters include a linear filtering coefficient and an energy gain value; the linear prediction filter is parameterized based on the linear filtering parameters, and the enhancement is performed on the enhanced linear prediction filter through the parameter-configured linear prediction filter.
  • the post-speech excitation signal is linearly synthesized and filtered, including:
  • the energy adjustment is performed on the historical long-term filtering excitation signal corresponding to the historical voice packet through the energy adjustment parameter, so as to obtain the adjusted historical long-term filtering excitation signal;
  • An audio signal enhancement device the device comprises:
  • a voice packet processing module used for sequentially decoding the received voice packets to obtain residual signal, long-term filtering parameters and linear filtering parameters; filtering the residual signal to obtain an audio signal;
  • a feature parameter extraction module for extracting feature parameters from the audio signal when the audio signal is a FEC frame signal
  • a signal conversion module for converting the audio signal into a filter speech excitation signal based on the linear filtering parameter
  • a speech enhancement module configured to perform speech enhancement processing on the filter speech excitation signal according to the characteristic parameter, the long-term filter parameter and the linear filter parameter to obtain an enhanced speech excitation signal;
  • a speech synthesis module configured to perform speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters to obtain a speech enhancement signal.
  • a computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:
  • the audio signal is a forward error correction frame signal, extracting characteristic parameters from the audio signal;
  • the voice excitation signal of the filter is subjected to speech enhancement processing to obtain an enhanced speech excitation signal
  • a speech enhancement signal is obtained by performing speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters.
  • the audio signal is a forward error correction frame signal, extracting characteristic parameters from the audio signal;
  • the voice excitation signal of the filter is subjected to speech enhancement processing to obtain an enhanced speech excitation signal
  • a speech enhancement signal is obtained by performing speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters.
  • a computer program comprising computer instructions stored in a computer-readable storage medium from which a processor of a computer device reads the computer instructions, the processor Executing the computer instructions causes the computer device to perform the following steps:
  • the audio signal is a forward error correction frame signal, extracting characteristic parameters from the audio signal;
  • the voice excitation signal of the filter is subjected to speech enhancement processing to obtain an enhanced speech excitation signal
  • a speech enhancement signal is obtained by performing speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters.
  • FIG. 1 is a schematic diagram of a speech generation model based on an excitation signal in one embodiment
  • Fig. 2 is the application environment diagram of the audio signal enhancement method in one embodiment
  • FIG. 3 is a schematic flowchart of an audio signal enhancement method in one embodiment
  • FIG. 4 is a schematic diagram of an audio signal transmission process flow in one embodiment
  • 5 is an amplitude-frequency response diagram of a long-term prediction filter in one embodiment
  • FIG. 6 is a schematic flowchart of a voice packet decoding and filtering step in one embodiment
  • FIG. 7 is an amplitude-frequency response diagram of a long-time inverse filter in one embodiment
  • FIG. 8 is a schematic diagram of a signal enhancement model in one embodiment
  • FIG. 9 is a schematic flowchart of an audio signal enhancement method in another embodiment
  • FIG. 10 is a schematic flowchart of an audio signal enhancement method in another embodiment
  • FIG. 11 is a structural block diagram of an audio signal enhancement apparatus in one embodiment
  • FIG. 12 is a structural block diagram of an audio signal enhancement apparatus in another embodiment
  • FIG. 13 is an internal structure diagram of a computer device in one embodiment
  • FIG. 14 is an internal structure diagram of a computer apparatus in another embodiment.
  • the impact signal impacts the vocal cords of people, producing quasi-periodic opening and closing, and after being amplified by the oral cavity, a sound is emitted, and the emitted sound corresponds to the filter in the speech generation model based on the excitation signal.
  • the filters in the speech generation model based on the excitation signal are subdivided into Long Term Prediction (LTP) filters and Linear Predictive Coding (LPC) filters.
  • LTP Long Term Prediction
  • LPC Linear Predictive Coding
  • the LTP filter uses the long-term correlation of speech to enhance the audio signal
  • the LPC filter uses the short-term correlation of the speech to strengthen the audio signal.
  • the excitation signal will impact the LTP filter and the LPC filter separately; for aperiodic signals such as unvoiced sounds, the excitation signal will only impact the LPC filter.
  • the solutions provided by the embodiments of the present application relate to technologies such as artificial intelligence machine learning, and are specifically described by the following embodiments:
  • the audio signal enhancement method provided by the present application is executed by computer equipment, and can be specifically applied to the application environment shown in FIG. 2 . middle.
  • the terminal 202 communicates with the server 204 through the network, the terminal 202 can receive the voice packets sent by the server 204, or the voice packets forwarded by other devices through the server 204, and the server 204 can receive the voice packets sent by the terminal, or other devices Sent voice packets.
  • the above audio signal enhancement method can be applied to the terminal 202 or the server 204, and is described by taking the terminal 202 as an example.
  • the terminal 202 sequentially decodes the received voice packets to obtain residual signals, long-term filtering parameters and linear filtering parameters. , filter the residual signal to obtain an audio signal; when the audio signal is a forward error correction frame signal, extract feature parameters from the audio signal; based on the linear filtering parameters, convert the audio signal into a filter speech excitation signal; parameters, long-term filtering parameters and linear filtering parameters, perform speech enhancement processing on the filter speech excitation signal to obtain an enhanced speech excitation signal; perform speech synthesis based on the enhanced speech excitation signal and linear filtering parameters to obtain a speech enhancement signal.
  • the terminal 202 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 204 can be an independent physical server, or a server cluster or distributed server composed of multiple physical servers. It can also provide basic cloud services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. Cloud servers for computing services.
  • an audio signal enhancement method is provided, and the method is applied to the computer equipment (terminal or server) in FIG. 2 as an example to illustrate, including the following steps:
  • S302 Decode the received voice packets in sequence to obtain a residual signal, long-term filtering parameters and linear filtering parameters; filter the residual signal to obtain an audio signal.
  • the received voice packets may be voice packets in an anti-packet loss scenario based on a forward error correction (Feedforward Error Correction, FEC) technology.
  • FEC forward error correction
  • Forward error correction technology is an error control method, which means that before the signal is sent into the transmission channel, it is encoded according to a certain algorithm in advance, and the redundant code with the characteristics of the signal itself is added. The received signal is decoded to find out the error code generated in the transmission process and correct it.
  • Redundancy may also be referred to as redundant information.
  • the signal transmitting end encodes the audio signal of the current voice frame (referred to as the current frame for short)
  • the The audio signal information of one frame) is encoded into the voice packet corresponding to the current frame audio signal as redundant information, and after the encoding is completed, the voice packet corresponding to the current frame audio signal is sent to the receiving end, and the receiving end receives the voice packet,
  • the audio signal of the next voice frame (referred to as the next frame for short) can be detected by the receiver.
  • the corresponding voice packets are decoded, thereby obtaining audio signals corresponding to the lost or erroneous voice packets, thereby improving the reliability of signal transmission.
  • the receiving end may be the terminal 202 in FIG. 2 .
  • the terminal when the terminal receives the voice packet, it stores the received voice packet in the cache, then takes out the voice packet corresponding to the voice frame to be played from the cache, and decodes and filters the voice packet to obtain the audio signal.
  • the voice packet when the voice packet is an adjacent packet of the historical voice packet decoded at the previous moment, and the historical voice packet decoded at the previous moment is not abnormal, then the obtained audio signal is directly output, or the audio signal is output.
  • Perform audio signal enhancement processing to obtain a voice enhanced signal, and output the voice enhanced signal; when the voice packet is not an adjacent packet of the historical voice packet decoded at the previous moment, or the voice packet is the history decoded at the previous moment.
  • the audio signal enhancement process is performed on the audio signal to obtain a voice enhanced signal, and the voice enhanced signal is output, wherein the voice enhanced signal It carries the audio signal corresponding to the adjacent packet of the historical voice packet decoded at the previous moment.
  • the decoding can be entropy decoding, and entropy decoding is a decoding scheme corresponding to entropy encoding. Specifically, when the transmitting end encodes the audio signal, it can use the entropy encoding scheme to encode the audio signal to obtain a voice packet, so that the receiving end is receiving it. When arriving at the voice packet, an entropy decoding scheme can be used to decode the received voice packet.
  • the terminal when receiving a voice packet, decodes the received voice packet to obtain a residual signal and filter parameters, and performs signal synthesis and filtering on the residual signal based on the filter parameters to obtain an audio signal.
  • the filter parameters include long-term filter parameters and linear filter parameters.
  • the transmitting end when encoding the audio signal of the current frame, obtains the filter parameters by analyzing the audio signal of the previous frame, and configures the parameters of the filter based on the obtained filter parameters, and then passes the configured filter parameters. Analyze and filter the audio signal of the current frame to obtain the residual signal of the audio signal of the current frame, and use the residual signal and the filter parameters obtained by the analysis to encode the audio signal to obtain a voice packet, and send the voice packet to the receiver. Therefore, after receiving the voice packet, the receiving end decodes the received voice packet to obtain a residual signal and filter parameters, and performs signal synthesis and filtering on the residual signal based on the filter parameters to obtain an audio signal.
  • the filter parameters include linear filtering parameters and long-term filtering parameters.
  • the transmitting end obtains the linear filtering parameters and the long-term filtering parameters by analyzing the audio signal of the previous frame, and then Perform linear analysis and filtering on the audio signal of the current frame based on the linear filtering parameters to obtain a linear filtering excitation signal, and then perform long-term analysis filtering on the linear filtering excitation signal based on the long-term filtering parameters to obtain the residual signal corresponding to the audio signal of the current frame, and use
  • the residual signal, the linear filtering parameters obtained by the analysis and the long-term filtering parameters are used to encode the audio signal of the current frame to obtain a voice packet, and the voice packet is sent to the receiving end.
  • performing linear analysis and filtering on the audio signal of the current frame based on the linear filtering parameters specifically includes: performing parameter configuration on the linear prediction filter based on the linear filtering parameters, and performing linear analysis and filtering on the audio signal of the linear prediction filter after the parameter configuration, to obtain a linear Filter the excitation signal, wherein the linear filter parameters include linear filter coefficients and energy gain values.
  • the linear filter coefficients can be recorded as LPC AR, and the energy gain value can be recorded as LPC gain.
  • the formula of the linear prediction filter is as follows:
  • e(n) is the linear filter excitation signal corresponding to the audio signal of the current frame
  • s(n) is the audio signal of the current frame
  • p is the number of sampling points contained in each frame of audio signal
  • a i is the analysis of the previous
  • s adj (ni) is the energy-adjusted state of the previous frame audio signal s (ni) of the current frame audio signal s (n)
  • s adj (ni) can be obtained by the following formula :
  • s(ni) is the previous frame audio signal of the current frame audio signal s(n)
  • gain adj is the energy adjustment parameter of the previous frame audio signal s(ni)
  • gain adj can be obtained by the following formula:
  • gain(n) is the energy gain value corresponding to the audio signal of the current frame
  • gain(n-i) is the energy gain value corresponding to the audio signal of the previous frame.
  • the long-term analysis and filtering of the linear filtering excitation signal based on the long-term filtering parameters specifically includes: performing parameter configuration on the long-term prediction filter based on the long-term filtering parameters, and performing a long-term analysis on the residual signal through the long-term prediction filter after the parameter configuration. Analyze and filter to obtain the corresponding residual signal of the audio signal of the current frame, wherein the long-term filtering parameters include the pitch period and the corresponding amplitude gain value, the pitch period can be recorded as LTP pitch, and the corresponding amplitude gain value can be recorded as LTP gain,
  • the frequency domain representation of the long-term prediction filter is as follows, and the frequency domain can be denoted as the Z domain:
  • p(z) is the amplitude-frequency response of the long-term prediction filter
  • z is the twiddle factor of the frequency domain transformation
  • is the amplitude gain value LTP gain
  • T is the pitch period LTP pitch
  • the time domain representation of the long-term prediction filter is as follows:
  • ⁇ (n) is the residual signal corresponding to the audio signal of the current frame
  • e(n) is the linear filter excitation signal corresponding to the audio signal of the current frame
  • is the amplitude gain value LTP gain
  • T is the pitch period LTP pitch
  • e(n-T) is the linear filtering excitation signal corresponding to the audio signal of the previous pitch period of the audio signal of the current frame.
  • the filter parameters decoded by the terminal include long-term filtering parameters and linear filtering parameters
  • the signal synthesis filtering includes long-term synthesis filtering based on the long-term filtering parameters, and linear synthesis filtering based on the linear filtering parameters.
  • the terminal after obtaining the residual signal, divides the obtained residual signal into multiple subframes to obtain multiple sub-residual signals. It performs long-term synthesis filtering to obtain a long-term filtered excitation signal corresponding to each subframe, and then combines the long-term filtered excitation signal corresponding to each subframe according to the timing of each subframe to obtain a corresponding long-term filtered excitation signal.
  • the residual signal can be divided into 4 subframes to obtain 4 sub-residual signals of 5ms.
  • the difference signal is subjected to long-term synthesis filtering based on the corresponding long-term filtering parameters, and four 5ms long-term filtering excitation signals are obtained, and then the four 5ms long-term filtering excitation signals are combined according to the timing of each subframe. , a 20ms long-term filtered excitation signal is obtained.
  • the terminal divides the obtained long-term filtered excitation signal into multiple subframes to obtain multiple sub-long-term filtered excitation signals, and then for each sub-long-term filtered excitation signal, Perform linear synthesis filtering based on the corresponding linear filtering parameters to obtain the sub-linear filtering excitation signal corresponding to each sub-frame, and then combine the linear filtering excitation signal corresponding to each sub-frame according to the timing of each sub-frame to obtain the corresponding Linearly filtered excitation signal.
  • the long-term filter excitation signal can be divided into two subframes, and two 10ms sub-long-term filter excitation signals can be obtained.
  • Each 10ms sub-long-time filter excitation signal is linearly synthesized and filtered based on the corresponding linear filtering parameters to obtain two 10ms sub-audio signals, and then the two 10ms sub-audio signals are based on the timing of each subframe. Combined to get a 20ms audio signal.
  • the audio signal is a forward error correction frame signal, which means that the audio signal of the historical adjacent frame of the audio signal is abnormal, and the abnormality of the audio signal of the historical adjacent frame specifically includes: the audio signal of the historical adjacent frame is not received.
  • the characteristic parameters include cepstral characteristic parameters.
  • the terminal determines whether the historical voice packet decoded before decoding the voice packet has data abnormality, if the decoded historical voice packet has data abnormality If it is abnormal, it is determined that the currently decoded and filtered audio signal is a forward error correction frame signal.
  • the terminal determines whether the historical audio signal corresponding to the historical voice packet decoded at the moment before decoding the voice packet is the audio signal of the previous frame of the audio signal obtained by decoding the voice packet, and if so, determines the historical voice packet There is no data abnormality, if not, it is determined that the historical voice packet has data abnormality.
  • the terminal determines whether the audio signal currently decoded and filtered is a forward error correction frame signal by determining whether the historical voice packet decoded before decoding the current voice packet is abnormal in data, and then the audio signal can be When the signal is a forward error correction frame signal, audio signal enhancement processing is performed on it to further improve the quality of the audio signal.
  • a feature parameter is extracted from the decoded audio signal, and the extracted feature parameter may specifically be a cepstrum feature parameter, which specifically includes the following steps : Perform Fourier transform on the audio signal to obtain the audio signal after Fourier transform; perform logarithmic processing on the audio signal after Fourier transform to obtain the logarithmic result; perform Fourier inverse on the obtained logarithmic result Transform to get the cepstral feature parameters.
  • the cepstral feature parameters can be extracted from the audio signal by the following formula:
  • C(n) is the cepstral characteristic parameter of the audio signal S(n) obtained after decoding and filtering
  • S(F) is the Fourier transform obtained by Fourier transforming the audio signal S(n). audio signal.
  • the terminal can enhance the audio signal based on the extracted cepstrum characteristic parameter by extracting the cepstrum characteristic parameter from the audio signal, thereby improving the quality of the audio signal.
  • the audio signal obtained after decoding and filtering can also be obtained from the current decoding and filtering.
  • Feature parameters are extracted to perform audio signal enhancement processing on the currently decoded and filtered audio signal.
  • the terminal can also obtain the linear filtering parameters obtained when decoding the voice packet, and perform linear analysis and filtering on the obtained audio signal based on the linear filtering parameters, so as to realize the audio
  • the signal is converted into a filter speech excitation signal.
  • S306 specifically includes the following steps: performing parameter configuration on the linear prediction filter based on the linear filtering parameters, and linearly decomposing and filtering the audio signal by using the linear prediction filter configured by the parameters to obtain the filter speech excitation signal.
  • the linear decomposition filtering is also called linear analysis filtering.
  • linear analysis filtering when linear analysis filtering is performed on the audio signal, the linear analysis filtering is directly performed on the audio signal of the entire frame, and there is no need to perform molecular frame processing on the audio signal of the entire frame. .
  • the terminal can use the following formula to linearly decompose and filter the audio signal to obtain the filter speech excitation signal:
  • D(n) is the filter speech excitation signal corresponding to the audio signal S(n) obtained after decoding and filtering the speech packet
  • S(n) is the audio signal obtained after decoding and filtering the speech packet
  • S adj (ni) is the energy-adjusted state of the audio signal S(ni) of the previous frame of the obtained audio signal S(n)
  • p is the number of sampling points included in each frame of audio signal
  • a i is the number of sampling points included in the decoded voice packet The resulting linear filter coefficients.
  • the terminal converts the audio signal into the filter speech excitation signal based on the linear filtering parameters, so that the audio signal can be enhanced by enhancing the filter speech excitation signal, thereby improving the quality of the audio signal.
  • the long-term filtering parameters include pitch period and amplitude gain value.
  • S308 includes the following steps: performing speech enhancement processing on the filter speech excitation signal according to the pitch period, the amplitude gain value, the linear filter parameter and the cepstral characteristic parameter to obtain an enhanced speech excitation signal.
  • the speech enhancement processing of the audio signal can be implemented by a pre-trained signal enhancement model, and the signal enhancement model is a neural network (Neural Network, NN) model, and the neural network model can specifically adopt the structure of LSTM and CNN.
  • the signal enhancement model is a neural network (Neural Network, NN) model
  • the neural network model can specifically adopt the structure of LSTM and CNN.
  • the terminal performs speech enhancement processing on the filter speech excitation signal according to the pitch period, the amplitude gain value, the linear filter parameter and the cepstral characteristic parameter to obtain the enhanced speech excitation signal, which can then be realized based on the enhanced speech excitation signal.
  • the enhancement of the audio signal improves the quality of the audio signal.
  • the terminal inputs the obtained characteristic parameters, long-term filtering parameters, linear filtering parameters, and the filter speech excitation signal into a pre-trained signal enhancement model, so that the signal enhancement model performs the filter speech excitation signal based on the characteristic parameters.
  • the voice enhancement processing is performed to obtain the enhanced voice excitation signal.
  • the terminal realizes the enhancement of the voice excitation signal after the enhancement through the pre-trained signal enhancement model, and then can realize the enhancement of the audio signal based on the enhanced voice excitation signal, which improves the quality of the audio signal and the efficiency of the audio signal enhancement processing.
  • the speech enhancement processing is performed on the entire frame of the filter speech excitation signal, and there is no need to perform the speech enhancement processing on the filter speech excitation signal.
  • the whole frame of the filter speech excitation signal is subjected to molecular frame processing.
  • S310 Perform speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters to obtain a speech enhancement signal.
  • the speech synthesis may be linear synthesis filtering based on linear filtering parameters.
  • the terminal after obtaining the enhanced speech excitation signal, performs parameter configuration on the linear prediction filter based on the linear filtering parameters, and performs linear synthesis filtering on the enhanced speech excitation signal by using the parameter-configured linear prediction filter, to obtain Speech enhancement signal.
  • the linear filtering parameters include linear filtering coefficients and energy gain values
  • the linear filtering coefficients can be recorded as LPC AR
  • the energy gain value can be recorded as LPC gain
  • the linear synthesis filtering is the linear analysis filtering performed by the transmitting end when encoding the audio signal. Therefore, the linear prediction filter that performs linear synthesis filtering is also called the linear inverse filter, and the time domain representation of the linear prediction filter is as follows:
  • Senh (n) is the speech enhancement signal
  • D enh (n) is the enhanced speech excitation signal obtained by performing the speech enhancement processing on the filter speech excitation signal D(n)
  • Sadj (ni) is the obtained audio signal
  • S(ni) of the previous frame of S(n) is the number of sampling points contained in each frame of audio signal
  • a i is the linear filter coefficient obtained by decoding the voice packet.
  • the energy-adjusted state of the audio signal S(ni) of the previous frame of the audio signal S(n), S adj (ni) can be obtained by the following formula:
  • S adj (ni) is the energy adjusted state of the audio signal S(ni) of the previous frame
  • gain adj is the energy adjustment parameter of the audio signal S(ni) of the previous frame
  • the terminal can obtain a speech enhancement signal by performing linear synthesis filtering on the enhanced speech excitation signal, that is, the enhancement processing of the audio signal is realized, and the quality of the audio signal is improved.
  • speech synthesis is performed on the entire frame of the enhanced speech excitation signal, and there is no need to perform molecular frame processing on the whole frame of the enhanced speech excitation signal.
  • the terminal when a terminal receives a voice packet, the terminal sequentially decodes and filters the voice packet to obtain an audio signal, and when the audio signal is a forward error correction frame signal, extracts characteristic parameters from the audio signal, based on the decoding
  • the linear filter coefficient obtained from the voice packet converts the audio signal into a filter voice excitation signal, so as to perform voice enhancement processing on the filter voice excitation signal according to the characteristic parameters and the long-term filtering parameters obtained by decoding the voice packet, and obtain the enhanced voice excitation signal.
  • speech synthesis is performed to obtain a speech enhancement signal, so that the enhancement processing of the audio signal can be completed in less time, and a better signal enhancement effect can be achieved, which improves the audio frequency. Timeliness of signal enhancement.
  • S302 specifically includes the following steps:
  • S602 Perform parameter configuration on the long-term prediction filter based on the long-term filtering parameters, and perform long-term synthesis filtering on the residual signal through the long-term prediction filter configured by the parameters to obtain a long-term filtering excitation signal.
  • the long-term filtering parameters include the pitch period and the corresponding amplitude gain value.
  • the pitch period can be recorded as LTP pitch, and the LTP pitch can also be called the pitch period.
  • the corresponding amplitude gain value can be recorded as LTP gain.
  • the long-term prediction filter performs long-term synthesis filtering on the residual signal, wherein the long-term synthesis filtering is the inverse process of the long-term analysis filtering performed when the transmitting end encodes the audio signal, so the long-term prediction of the long-term synthesis filtering is performed.
  • the filter is also called a long-term inverse filter, that is, a long-term inverse filter is used to process the residual signal.
  • the frequency domain representation of the long-term inverse filter corresponding to formula (1) is as follows:
  • p -1 (z) is the amplitude-frequency response of the long-time inverse filter
  • z is the rotation factor of the frequency domain transformation
  • is the amplitude gain value LTP gain
  • T is the pitch period LTP pitch.
  • E(n) is the long-term filter excitation signal corresponding to the voice packet
  • ⁇ (n) is the residual signal corresponding to the voice packet
  • is the amplitude gain value LTP gain
  • T is the pitch period LTP pitch
  • E( n-T) is the long-term filtering excitation signal corresponding to the audio signal of the previous pitch period of the voice packet.
  • the long-term filtering excitation signal E(n) obtained by performing long-term synthesis filtering on the residual signal at the receiving end through the long-term inverse filter is encoded with the transmitting end through the linear filter.
  • the linear filtering excitation signal e(n) obtained by performing linear analysis filtering on the audio signal is the same.
  • S604 perform parameter configuration on the linear prediction filter based on the linear filtering parameters, and perform linear synthesis filtering on the long-term filtering excitation signal through the linear prediction filter after the parameter configuration, to obtain an audio signal.
  • the linear filtering parameters include linear filtering coefficients and energy gain values
  • the linear filtering coefficients can be recorded as LPC AR
  • the energy gain value can be recorded as LPC gain
  • the linear synthesis filtering is the linear analysis filtering performed by the transmitting end when encoding the audio signal. Therefore, the linear prediction filter that performs linear synthesis filtering is also called the linear inverse filter, and the time domain representation of the linear prediction filter is as follows:
  • S(n) is the corresponding audio signal of the voice packet
  • E(n) is the corresponding long-term filter excitation signal of the voice packet
  • S adj (ni) is the previous one to obtain the audio signal S(n).
  • p is the number of sampling points included in each frame of audio signal
  • a i is the linear filter coefficient obtained by decoding the voice packet.
  • the energy-adjusted state of the audio signal S(ni) of the previous frame of the audio signal S(n), S adj (ni) can be obtained by the following formula:
  • gain adj is the energy adjustment parameter of the audio signal S(ni) of the previous frame
  • gain(n) is the energy gain value obtained by decoding the voice packet
  • gain(ni) is the energy gain value corresponding to the audio signal of the previous frame.
  • the terminal performs long-term synthesis filtering on the residual signal based on the long-term filtering parameters to obtain the long-term filtering excitation signal; and performs linear synthesis filtering on the long-term filtering excitation signal based on the linear filtering parameters obtained by decoding to obtain the audio signal, Therefore, when the audio signal is not a forward error correction frame signal, the audio signal can be directly output, and when the audio signal is a forward error correction frame signal, the audio signal is enhanced and output, which improves the timeliness of the audio signal output.
  • S604 specifically includes the following steps: dividing the long-term filtering excitation signal into at least two subframes to obtain sub-long-term filtering excitation signals; grouping the linear filtering parameters obtained by decoding to obtain at least two linear filtering parameters set; respectively perform parameter configuration on at least two linear prediction filters based on the linear filtering parameter set; input the obtained sub-long-term filtering excitation signals into the linear prediction filters after parameter configuration respectively, so that the linear prediction filters are based on the linear filtering parameters
  • the set performs linear synthesis filtering on the sub-long-time filtered excitation signal to obtain sub-audio signals corresponding to each sub-frame; and combines the sub-audio signals according to the timing of each sub-frame to obtain an audio signal.
  • the linear filter parameter set has two types: linear filter coefficient set and energy gain value set.
  • S(n) in formula (12) is any sub- The sub-audio signal corresponding to the frame
  • E(n) is the long-term filter excitation signal corresponding to the sub-frame
  • Sadj (ni) is the sub-audio signal S( ni) state after energy adjustment
  • p is the number of sampling points included in each subframe audio signal
  • a i is the set of linear filter coefficients corresponding to the subframe
  • gain adj in formula (13) is the sub audio signal
  • the energy adjustment parameter of the sub-audio signal of the previous subframe gain(n) is the energy gain value of the sub-audio signal
  • gain(ni) is the energy gain value of the sub-audio signal of the previous subframe of the sub-audio signal.
  • the terminal obtains the sub-long-term filtering excitation signal by dividing the long-term filtering excitation signal into at least two subframes; grouping the linear filtering parameters obtained by decoding to obtain at least two linear filtering parameter sets; based on the linear filtering
  • the parameter set respectively configures the parameters of at least two linear prediction filters; the obtained sub-long-term filtering excitation signals are respectively input into the linear prediction filters after the parameter configuration, so that the linear prediction filter is based on the linear filtering parameter set.
  • the audio signal sent by the terminal improves the quality of the restored audio signal.
  • the linear filtering parameters include a linear filtering coefficient and an energy gain value; S604 further includes the following steps: For the sub-long-term filtering excitation signal corresponding to the first subframe in the long-term filtering excitation signal, obtain a historical long-term filtering excitation signal The energy gain value of the historical sub-long-term filtering excitation signal of the sub-frame adjacent to the sub-long-term filtering excitation signal corresponding to the first sub-frame in the excitation signal; based on the energy gain value corresponding to the historical sub-long-term filtering excitation signal and the first sub-frame The energy gain value of the sub-long-term filtering excitation signal corresponding to the sub-frame is used to determine the energy adjustment parameters corresponding to the sub-long-term filtering excitation signal; the energy adjustment parameters are used to adjust the energy of the historical sub-long-term filtering excitation signal to obtain the energy-adjusted history Sub-long-time filtered excitation signal.
  • the historical long-term filtering excitation signal is the long-term filtering excitation signal of the previous frame of the long-term filtering excitation signal of the current frame, and the sub-long-term filtering excitation signal adjacent to the sub-long-term filtering excitation signal corresponding to the first subframe in the historical long-term filtering excitation signal
  • the historical sub-long-term filtering excitation signal of the frame is the sub-long-term filtering excitation signal corresponding to the last sub-frame of the long-term filtering excitation signal of the previous frame.
  • the long-term filtering excitation signal of the current frame is divided into two subframes, and the sub-long-term filtering excitation signal corresponding to the first subframe and the sub-long-term filtering excitation signal corresponding to the second subframe are obtained.
  • the sub-long-term filtered excitation signal corresponding to the second sub-frame of the filtered excitation signal, and the sub-long-term filtered excitation signal corresponding to the first sub-frame of the current frame is an adjacent sub-frame.
  • the terminal after obtaining the energy-adjusted historical sub-long-term filtering excitation signal, the terminal inputs the obtained sub-long-term filtering excitation signal and the energy-adjusted historical sub-long-term filtering excitation signal into the parameter-configured A linear prediction filter, so that the linear prediction filter performs linear synthesis filtering on the sub-long-term filtering excitation signal corresponding to the first subframe based on the historical sub-long-term filtering excitation signal obtained after the linear filter coefficient and energy adjustment to obtain the first sub-frame.
  • the sub-audio signal corresponding to the sub-frame.
  • a voice packet corresponds to an audio signal of 20ms, that is, the long-term filter excitation signal obtained is 20ms
  • the AR coefficient obtained by decoding the voice packet is ⁇ A 1 ,A 2 ,...,A p-1 ,A p ,A p+1 ,...A 2p-1 , A 2p ⁇
  • the energy gain value obtained by decoding the speech packet is ⁇ gain 1 (n), gain 2 (n) ⁇
  • the long-term filtering excitation signal can be divided into two subsections frame, obtain the first sub-filter excitation signal E 1 (n) corresponding to the first 10ms and the second sub-filter excitation signal E 2 (n) corresponding to the last 10ms, and group the AR coefficients to obtain an AR coefficient set 1 ⁇ A 1 ,A 2 ,...,A p-1 ,A p ⁇ and the AR coefficient set 2 ⁇ A p+1 ,...A 2p-1 ,A 2p ⁇ , group the energy gain values, and obtain the energy gain value set 1 ⁇
  • the set of energy gain values of the previous subframe of a sub-filtered excitation signal E 1 (n) is ⁇ gain 2 (ni) ⁇
  • the sub-filtered excitation signal of the previous sub-frame of the second sub-filtered excitation signal E 2 (n) is E 1 (n)
  • the energy gain value set of the previous subframe of the second sub-filtered excitation signal E 2 (n) is ⁇ gain 1 (n) ⁇
  • the first sub-filtered excitation signal E 1 (n) corresponds to
  • the sub-audio signal can be obtained by substituting the corresponding parameters into formula (12) and formula (13)
  • the sub-audio signal corresponding to the second sub-filtering excitation signal E 2 (n) can be obtained by substituting the corresponding parameters into formula (12) and formula (13) ) to obtain.
  • the terminal obtains the adjacent sub-long-term filtering excitation signal corresponding to the first sub-frame in the historical long-term filtering excitation signal for the sub-long-term filtering excitation signal corresponding to the first sub-frame in the long-term filtering excitation signal.
  • the characteristic parameter includes a cepstral characteristic parameter
  • S308 includes the following steps: performing vectorization processing on the cepstral characteristic parameter, the long-term filtering parameter and the linear filtering parameter, and splicing the result obtained by the vectorization processing to obtain a characteristic vector; Input the feature vector and the filter speech excitation signal into the pre-trained signal enhancement model; perform feature extraction on the feature vector through the signal enhancement model to obtain the target feature vector; based on the target feature vector, the filter speech excitation signal is enhanced, and the enhanced Voice excitation signal.
  • the signal enhancement model is a multi-level network structure, which specifically includes a first feature splicing layer, a second feature splicing layer, a first neural network layer and a second neural network layer.
  • the target feature vector is the enhanced feature vector.
  • the terminal performs vectorization processing on the cepstrum feature parameters, long-term filtering parameters and linear filtering parameters through the first feature splicing layer of the signal enhancement model, and splices the results obtained by the vectorization processing to obtain a feature vector, and then uses the obtained feature
  • the vector is input to the first neural network layer of the signal enhancement model, and the feature vector is extracted through the first neural network layer to obtain the primary feature vector, and the primary feature vector and the linear filter coefficients in the linear filter parameters are Fourier transform
  • the obtained envelope information is transformed, the second feature splicing layer of the input signal enhancement model, the primary feature vector after splicing, and the primary feature vector after splicing is input into the second neural network layer of the signal enhancement model.
  • the primary eigenvectors are extracted to obtain the target eigenvector, and then the filter speech excitation signal is enhanced based on the target eigenvector to obtain the enhanced speech excitation signal.
  • the terminal obtains a feature vector by performing vectorization processing on the cepstrum feature parameters, long-term filtering parameters and linear filtering parameters, and splicing the results obtained by the vectorized processing;
  • the signal enhancement model is based on the signal enhancement model; the feature vector is extracted by the signal enhancement model to obtain the target feature vector; based on the target feature vector, the filter speech excitation signal is enhanced, and the enhanced speech excitation signal is obtained, so that the signal enhancement model can be used.
  • the enhancement processing of the audio signal improves the quality of the audio signal and the efficiency of the enhancement processing for the audio signal.
  • the terminal performs enhancement processing on the filter speech excitation signal based on the target feature vector to obtain the enhanced speech excitation signal, including: performing Fourier transform on the filter speech excitation signal to obtain the frequency domain speech excitation signal;
  • the target feature vector enhances the amplitude feature of the frequency domain speech excitation signal; the inverse Fourier transforms the frequency domain speech excitation signal with the enhanced amplitude feature to obtain the enhanced speech excitation signal.
  • the terminal after performing Fourier transform on the filter speech excitation signal, the terminal obtains the frequency domain speech excitation signal, and after enhancing the amplitude feature of the frequency domain speech excitation signal based on the target feature vector, combined with the unenhanced frequency domain
  • the phase characteristic of the speech excitation signal is inverse Fourier transform of the frequency domain speech excitation signal with the enhanced amplitude characteristic to obtain the enhanced speech excitation signal.
  • the two feature splicing layers are concat1 and concat2 respectively, and the two neural network layers are NN part1 and NN part2 respectively.
  • the cepstrum feature parameter Cepstrum of dimension 40 and the pitch period LTP of dimension 1 are combined by concat1.
  • the pitch and the amplitude gain value LTP Gain of dimension 1 are spliced together to form a feature vector of dimension 42, and the feature vector of dimension 42 is input into NN part1, which consists of a two-layer convolutional neural network and two It consists of layers of fully connected network.
  • the dimension of the convolution kernel of the first layer is (1, 128, 3, 1)
  • the dimension of the convolution kernel of the second layer is (128, 128, 3, 1)
  • the number of nodes of the fully connected layer is 128 and 8
  • the activation function at the end of each layer is the Tanh function
  • the high-level features are extracted from the feature vector through NN part1 to obtain the primary feature vector of dimension 1024, and then the primary feature vector of dimension 1024 is concat2.
  • the envelope information Envelope with dimension 161 obtained by the Fourier transform of the linear filter coefficient LPC AR in the splicing obtains the spliced primary feature vector with dimension 1185, and the spliced primary feature vector with dimension 1185 is input into NN part2, NN part 2 is a two-layer fully connected network with 256 and 161 nodes respectively.
  • the activation function at the end of each layer is the Tanh function.
  • the target feature vector is obtained through NN part 2, and then based on the target feature vector, the filter voice is excited.
  • the amplitude characteristic Excitation of the frequency domain speech excitation signal obtained after the Fourier transform of the signal is enhanced, and the inverse Fourier transform is performed on the filter speech excitation signal of the enhanced amplitude characteristic Excitation to obtain the enhanced speech excitation signal D enh (n ).
  • the terminal obtains the frequency domain speech excitation signal by performing Fourier transform on the filter speech excitation signal; based on the target feature vector, the amplitude feature of the frequency domain speech excitation signal is enhanced; The inverse Fourier transform of the domain speech excitation signal is used to obtain the enhanced speech excitation signal, so that the audio signal can be enhanced while the phase information of the audio signal is kept unchanged, and the quality of the audio signal can be improved.
  • the linear filtering parameters include a linear filtering coefficient and an energy gain value; the terminal performs parameter configuration on the linear prediction filter based on the linear filtering parameters, and linearly synthesizes the enhanced speech excitation signal through the linear prediction filter after the parameter configuration.
  • the filtering step includes: performing parameter configuration on the linear prediction filter based on the linear filter coefficient; obtaining the energy gain value corresponding to the historical voice packet decoded before decoding the voice packet; based on the energy gain value corresponding to the historical voice packet and the corresponding energy gain value of the voice packet
  • the energy gain value is used to determine the energy adjustment parameter; the energy adjustment is performed on the historical long-term filtering excitation signal corresponding to the historical voice packet through the energy adjustment parameter, and the adjusted historical long-term filtering excitation signal is obtained; the adjusted historical long-term filtering excitation signal and enhancement
  • the post-speech excitation signal is input to the parameter-configured linear prediction filter, so that the linear prediction filter performs linear synthesis filtering on the enhanced speech excitation signal based on the adjusted historical long-term filtering excitation signal.
  • the historical audio signal corresponding to the historical voice packet is the previous frame audio signal of the current frame audio signal corresponding to the current voice packet.
  • the energy gain value corresponding to the historical voice packet may be the energy gain value corresponding to the audio signal of the whole frame of the historical voice, or may be the energy gain value corresponding to the partial subframe audio signal of the historical voice packet.
  • the audio signal is not a forward error correction frame signal, that is, the audio signal of the previous frame of the audio signal of the current frame is obtained after the terminal has normally decoded the historical voice packets, it can be obtained when the terminal decodes the historical voice packets.
  • the obtained energy gain value of the historical voice packet and determine the energy adjustment parameter based on the energy gain value of the historical voice packet; when the audio signal is a forward error correction frame, that is, the previous frame audio signal of the current frame audio signal fails to pass
  • the terminal normally decodes the historical voice packets, then determines the compensated energy gain value corresponding to the previous frame of audio signal based on the preset energy gain compensation mechanism, and determines the compensated energy gain value as the energy gain value of the historical voice packets , to determine the energy adjustment parameter based on the energy gain value of the historical voice packets.
  • the energy adjustment parameter gain adj of the previous frame audio signal S(ni) can be calculated by the following formula:
  • gain adj is the energy adjustment parameter of the audio signal S(ni) of the previous frame
  • gain(ni) is the energy gain value of the audio signal S(ni) of the previous frame
  • gain(n) is the energy gain value of the audio signal of the current frame
  • the formula (14) is to calculate the energy adjustment parameter based on the energy gain value corresponding to the whole frame of audio signal of the historical speech.
  • the energy adjustment parameter gain adj of the previous frame audio signal S(ni) can be obtained by the following formula:
  • gain adj is the energy adjustment parameter of the previous frame of audio signal S(ni)
  • gain m (ni) the energy gain value of the mth subframe of the previous frame of audio signal S(ni)
  • gain m (n) is the energy gain value of the mth subframe of the audio signal of the current frame
  • m is the number of subframes corresponding to each audio signal
  • ⁇ gain 1 (n)+...+gain(n) ⁇ /m is the energy of the audio signal of the current frame gain value.
  • the formula (15) is to calculate the energy adjustment parameter based on the energy gain value corresponding to the partial subframe audio signal of the historical speech.
  • the terminal performs parameter configuration on the linear prediction filter based on the linear filter coefficient; obtains the energy gain value corresponding to the historical voice packet decoded before decoding the voice packet; based on the energy gain value corresponding to the historical voice packet and the voice packet The corresponding energy gain value is used to determine the energy adjustment parameter; the energy adjustment is performed on the historical long-term filtering excitation signal corresponding to the historical voice packet through the energy adjustment parameter, and the adjusted historical long-term filtering excitation signal is obtained; the adjusted historical long-term filtering excitation signal is and the enhanced speech excitation signal are input to the linear prediction filter after parameter configuration, so that the linear prediction filter based on the adjusted historical long-term filtering excitation signal, performs linear synthesis filtering on the enhanced speech excitation signal, thereby smoothing different frames
  • the audio signal of different frames improves the quality of the speech composed of audio signals of different frames.
  • a method for enhancing an audio signal is provided, and the method is applied to the computer equipment (terminal or server) in FIG. 2 as an example for description, including the following steps:
  • S902 Decode the voice packet to obtain a residual signal, long-term filtering parameters and linear filtering parameters.
  • S904 perform parameter configuration on the long-term prediction filter based on the long-term filtering parameters, and perform long-term synthesis filtering on the residual signal through the long-term prediction filter after the parameter configuration, to obtain a long-term filter excitation signal.
  • S906 Divide the long-term filtered excitation signal into at least two subframes to obtain sub-long-term filtered excitation signals.
  • S908 Group the de-linear filtering parameters to obtain at least two sets of linear filtering parameters.
  • S910 Perform parameter configuration on at least two linear prediction filters respectively based on the linear filtering parameter set.
  • S914 combine the sub-audio signals according to the time sequence of each sub-frame to obtain an audio signal.
  • S916 Determine whether data abnormality occurs in the historical voice packets decoded before the voice packets are decoded.
  • S922 Perform parameter configuration on the linear prediction filter based on the linear filtering parameters, and linearly decompose and filter the audio signal through the linear prediction filter after the parameter configuration, to obtain a filter speech excitation signal.
  • S926 Perform parameter configuration on the linear prediction filter based on the linear filtering parameters, and perform linear synthesis filtering on the enhanced speech excitation signal through the linear prediction filter configured by the parameters to obtain a speech enhancement signal.
  • the present application also provides an application scenario where the above-mentioned audio signal enhancement method is applied.
  • the application of the audio signal enhancement method in this application scenario is as follows:
  • the terminal after receiving a voice packet corresponding to a frame of audio signal, the terminal performs entropy decoding on the voice packet to obtain ⁇ (n), LTP pitch, LTP gain, LPC AR and LPC gain, based on LTP pitch and LTP gain Perform LTP synthesis filtering on ⁇ (n) to obtain E(n), perform LPC synthesis filtering on each subframe of E(n) based on LPC AR and LPC gain, and combine the LPC synthesis filtering results to obtain a frame S(n ), and then perform cepstral analysis on S(n) to obtain C(n), and perform LPC decomposition and filtering on S(n) of the entire frame based on LPC AR and LPC gain to obtain the entire frame D(n), and the LTP pitch , LTP gain, the envelope information after LPC AR Fourier transform, C(n) and D(n) are input to the pre-trained signal enhancement model NN postfilter, and the whole frame D(n) is enhanced by NN postfilter to get
  • FIGS. 3 , 4 , 6 , 9 and 10 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIG. 3, FIG. 4, FIG. 6, FIG. 9 and FIG. 10 may include multiple steps or multiple stages, and these steps or stages are not necessarily executed and completed at the same time, but may be performed at different times. The execution sequence of these steps or stages is not necessarily carried out sequentially, but may be executed in turn or alternately with other steps or at least a part of the steps or stages in the other steps.
  • an audio signal enhancement apparatus is provided, and the apparatus can use software modules or hardware modules, or a combination of the two to become a part of computer equipment, and the apparatus specifically includes: a voice packet The processing module 1102, the feature parameter extraction module 1104, the signal conversion module 1106, the speech enhancement module 1108 and the speech synthesis module 1110, wherein:
  • the voice packet processing module 1102 is used for sequentially decoding and filtering the received voice packets to obtain residual signals, long-term filtering parameters and linear filtering parameters; filtering the residual signals to obtain audio signals.
  • the feature parameter extraction module 1104 is configured to extract feature parameters from the audio signal when the audio signal is a forward error correction frame signal.
  • the signal conversion module 1106 is configured to convert the audio signal into a filter speech excitation signal based on the linear filtering parameters.
  • the speech enhancement module 1108 is configured to perform speech enhancement processing on the filter speech excitation signal according to the characteristic parameter, the long-term filter parameter and the linear filter parameter to obtain an enhanced speech excitation signal.
  • the speech synthesis module 1110 is configured to perform speech synthesis based on the enhanced speech excitation signal and linear filtering parameters to obtain a speech enhancement signal.
  • the computer equipment obtains the residual signal, the long-term filtering parameter and the linear filtering parameter by sequentially decoding the received voice packets, and filters the residual signal to obtain the audio signal, and the audio signal is before the audio signal.
  • the error correction frame signal is used, the feature parameters are extracted from the audio signal, and the audio signal is converted into a filter voice excitation signal based on the linear filter coefficients obtained by decoding the voice packets, so as to obtain the long-term filtering parameters according to the feature parameters and the decoded voice packets.
  • the speech packet processing module 1102 is further configured to: perform parameter configuration on the long-term prediction filter based on the long-term filtering parameters, and perform long-term synthesis filtering on the residual signal through the parameter-configured long-term prediction filter , obtain the long-term filtering excitation signal; perform parameter configuration on the linear prediction filter based on the linear filtering parameters, and perform linear synthesis filtering on the long-term filtering excitation signal through the linear prediction filter after the parameter configuration to obtain the audio signal.
  • the terminal performs long-term synthesis filtering on the residual signal based on the long-term filtering parameters to obtain the long-term filtering excitation signal; and performs linear synthesis filtering on the long-term filtering excitation signal based on the linear filtering parameters obtained by decoding to obtain the audio signal, Therefore, when the audio signal is not a forward error correction frame signal, the audio signal can be directly output, and when the audio signal is a forward error correction frame signal, the audio signal is enhanced and output, which improves the timeliness of the audio signal output.
  • the voice packet processing module 1102 is further configured to: divide the long-term filtering excitation signal into at least two subframes to obtain sub-long-term filtering excitation signals; group the linear filtering parameters to obtain at least two linear filtering parameters parameter set; respectively perform parameter configuration on at least two linear prediction filters based on the linear filtering parameter set; input the obtained sub-long-term filtering excitation signals into the linear prediction filters after parameter configuration respectively, so that the linear prediction filters are based on linear filtering
  • the parameter set performs linear synthesis filtering on the sub-long-time filtered excitation signal to obtain sub-audio signals corresponding to each sub-frame; and combines the sub-audio signals according to the timing of each sub-frame to obtain an audio signal.
  • the terminal obtains the sub-long-term filtering excitation signal by dividing the long-term filtering excitation signal into at least two subframes; grouping the linear filtering parameters to obtain at least two linear filtering parameter sets; Parameter configuration is performed on at least two linear prediction filters; the obtained sub-long-term filtering excitation signals are respectively input into the linear prediction filters after the parameter configuration, so that the linear prediction filter is based on the linear filtering parameter set.
  • Perform linear synthesis filtering to obtain sub-audio signals corresponding to each sub-frame; combine the sub-audio signals according to the timing of each sub-frame to obtain an audio signal, so as to ensure that the obtained audio signal can be better restored to the sending end. the audio signal, which improves the quality of the restored audio signal.
  • the linear filtering parameters include a linear filtering coefficient and an energy gain value; the voice packet processing module 1102 is further configured to: obtain the sub-long-term filtering excitation signal corresponding to the first sub-frame in the long-term filtering excitation signal, obtaining The energy gain value corresponding to the historical sub-long-term filtering excitation signal of the subframe adjacent to the sub-long-term filtering excitation signal corresponding to the first subframe in the historical long-term filtering excitation signal; based on the energy corresponding to the historical sub-long-term filtering excitation signal The gain value and the energy gain value of the sub-long-term filtering excitation signal corresponding to the first subframe, determine the energy adjustment parameter corresponding to the sub-long-term filtering excitation signal; perform energy adjustment on the historical sub-long-term filtering excitation signal through the energy adjustment parameter; The obtained sub-long-term filtering excitation signal and the energy-adjusted historical sub-long-term filtering excitation signal are input to the linear prediction filter after parameter configuration,
  • the terminal obtains the adjacent sub-long-term filtering excitation signal corresponding to the first sub-frame in the historical long-term filtering excitation signal for the sub-long-term filtering excitation signal corresponding to the first sub-frame in the long-term filtering excitation signal.
  • the apparatus further includes: a data abnormality determination module 1112 and a forward error correction frame signal determination module 1114, wherein: a data abnormality determination module 1112 is configured to determine the decoded voice packet before decoding the voice packet. Whether there is data abnormality in the historical voice packets of the audio signal; the FEC frame signal determination module 1114 is used for determining the audio signal obtained by decoding and filtering as the FEC frame signal if the data abnormality occurs in the historical voice packet.
  • the terminal determines whether the decoded historical voice packet before decoding the current voice packet has data abnormality, thereby determining whether the current audio signal obtained by decoding and filtering is a forward error correction frame signal, and then can be used in the audio signal.
  • the signal is a forward error correction frame signal
  • audio signal enhancement processing is performed on it to further improve the quality of the audio signal.
  • the feature parameters include cepstral feature parameters; the feature parameter extraction module 1104 is further configured to: perform Fourier transform on the audio signal to obtain a Fourier transformed audio signal; The audio signal is processed logarithmically to obtain logarithmic results; the logarithmic results are subjected to inverse Fourier transform to obtain cepstral characteristic parameters.
  • the terminal can enhance the audio signal based on the extracted cepstrum characteristic parameter by extracting the cepstrum characteristic parameter from the audio signal, thereby improving the quality of the audio signal.
  • the long-term filtering parameters include a pitch period and an amplitude gain value; the speech enhancement module 1108 is further configured to: filter the speech excitation signal according to the pitch period, the amplitude gain value, the linear filtering parameter and the cepstral characteristic parameter. The voice enhancement processing is performed to obtain the enhanced voice excitation signal.
  • the terminal performs speech enhancement processing on the filter speech excitation signal according to the pitch period, the amplitude gain value, the linear filter parameter and the cepstral characteristic parameter to obtain the enhanced speech excitation signal, which can then be realized based on the enhanced speech excitation signal.
  • the enhancement of the audio signal improves the quality of the audio signal.
  • the signal conversion module 1106 is further configured to: perform parameter configuration on the linear prediction filter based on the linear filtering parameters, and perform linear decomposition filtering on the audio signal through the linear prediction filter after the parameter configuration, to obtain the filter speech excitation Signal.
  • the terminal converts the audio signal into the filter speech excitation signal based on the linear filtering parameters, so that the audio signal can be enhanced by enhancing the filter speech excitation signal, thereby improving the quality of the audio signal.
  • the speech enhancement module 1108 is further configured to: input the feature parameters, long-term filter parameters, linear filter parameters, and filter speech excitation signals into the pre-trained signal enhancement model, so that the signal enhancement model based on the feature parameter pairs The speech excitation signal of the filter is subjected to speech enhancement processing to obtain an enhanced speech excitation signal.
  • the terminal realizes the enhancement of the voice excitation signal after the enhancement through the pre-trained signal enhancement model, and then can realize the enhancement of the audio signal based on the enhanced voice excitation signal, which improves the quality of the audio signal and the efficiency of the audio signal enhancement processing.
  • the feature parameters include cepstrum feature parameters
  • the speech enhancement module 1108 is further configured to: perform vectorization processing on the cepstrum feature parameters, long-term filter parameters and linear filter parameters, and concatenate the results obtained by the vectorization process Obtain the feature vector; input the feature vector and the filter speech excitation signal into the pre-trained signal enhancement model; perform feature extraction on the feature vector through the signal enhancement model to obtain the target feature vector; based on the target feature vector, the filter speech excitation signal is enhanced. , the enhanced speech excitation signal is obtained.
  • the terminal obtains a feature vector by performing vectorization processing on the cepstrum feature parameters, long-term filtering parameters and linear filtering parameters, and splicing the results obtained by the vectorized processing;
  • the signal enhancement model is based on the signal enhancement model; the feature vector is extracted by the signal enhancement model to obtain the target feature vector; based on the target feature vector, the filter speech excitation signal is enhanced, and the enhanced speech excitation signal is obtained, so that the signal enhancement model can be used.
  • the enhancement processing of the audio signal improves the quality of the audio signal and the efficiency of the enhancement processing for the audio signal.
  • the speech enhancement module 1108 is further configured to: perform Fourier transform on the filter speech excitation signal to obtain a frequency domain speech excitation signal; enhance the amplitude feature of the frequency domain speech excitation signal based on the target feature vector ; The inverse Fourier transform of the frequency domain speech excitation signal with the enhanced amplitude feature is obtained, and the enhanced speech excitation signal is obtained.
  • the terminal obtains the frequency domain speech excitation signal by performing Fourier transform on the filter speech excitation signal; based on the target feature vector, the amplitude feature of the frequency domain speech excitation signal is enhanced; The inverse Fourier transform of the domain speech excitation signal is used to obtain the enhanced speech excitation signal, so that the audio signal can be enhanced while the phase information of the audio signal is kept unchanged, and the quality of the audio signal can be improved.
  • the speech synthesis module 1110 is further configured to: perform parameter configuration on the linear prediction filter based on the linear filtering parameters, and perform linear synthesis filtering on the enhanced speech excitation signal through the linear prediction filter after the parameter configuration, to obtain speech boost the signal.
  • the terminal can obtain a speech enhancement signal by performing linear synthesis filtering on the enhanced speech excitation signal, that is, the enhancement processing of the audio signal is realized, and the quality of the audio signal is improved.
  • the linear filtering parameters include linear filtering coefficients and energy gain values; the speech synthesis module 1110 is further configured to: configure parameters for the linear prediction filter based on the linear filtering coefficients; obtain the history decoded before decoding the speech packets The energy gain value corresponding to the voice packet; the energy adjustment parameter is determined based on the energy gain value corresponding to the historical voice packet and the energy gain value corresponding to the voice packet; the energy adjustment is performed on the historical long-term filter excitation signal corresponding to the historical voice packet through the energy adjustment parameter , obtain the adjusted historical long-term filtering excitation signal; input the adjusted historical long-term filtering excitation signal and the enhanced speech excitation signal to the linear prediction filter after parameter configuration, so that the linear prediction filter is based on the adjusted historical long-term filtering The excitation signal is linearly synthesized and filtered on the enhanced speech excitation signal.
  • the terminal performs parameter configuration on the linear prediction filter based on the linear filter coefficient; obtains the energy gain value corresponding to the historical voice packet decoded before decoding the voice packet; based on the energy gain value corresponding to the historical voice packet and the voice packet The corresponding energy gain value is used to determine the energy adjustment parameter; the energy adjustment is performed on the historical long-term filtering excitation signal corresponding to the historical voice packet through the energy adjustment parameter, and the adjusted historical long-term filtering excitation signal is obtained; the adjusted historical long-term filtering excitation signal is and the enhanced speech excitation signal are input to the linear prediction filter after parameter configuration, so that the linear prediction filter based on the adjusted historical long-term filtering excitation signal, performs linear synthesis filtering on the enhanced speech excitation signal, thereby smoothing different frames
  • the audio signal of different frames improves the quality of the speech composed of audio signals of different frames.
  • Each module in the above-mentioned audio signal enhancement apparatus can be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 13 .
  • the computer device includes a processor, memory, and a network interface connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the nonvolatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium.
  • the computer device's database is used to store voice packet data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program when executed by a processor implements an audio signal enhancement method.
  • a computer device is provided, and the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 14 .
  • the computer equipment includes a processor, memory, a communication interface, a display screen, and an input device connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the nonvolatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium.
  • the communication interface of the computer device is used for wired or wireless communication with an external terminal, and the wireless communication can be realized by WIFI, operator network, NFC (Near Field Communication) or other technologies.
  • the computer program when executed by a processor implements an audio signal enhancement method.
  • the display screen of the computer equipment may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball or a touchpad set on the shell of the computer equipment , or an external keyboard, trackpad, or mouse.
  • FIG. 13 or FIG. 14 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • a computer device may include more or fewer components than those shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device including a memory and a processor, where a computer program is stored in the memory, and the processor implements the steps in the foregoing method embodiments when the processor executes the computer program.
  • a computer-readable storage medium which stores a computer program, and when the computer program is executed by a processor, implements the steps in the foregoing method embodiments.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps in the foregoing method embodiments.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种音频信号增强方法、装置、计算机设备、计算机可读存储介质和程序产品,该方法包括:对接收到的语音包依序进行解码,得到残差信号、长时滤波参数和线性滤波参数;对残差信号进行滤波,得到音频信号(S302);当音频信号为前向纠错帧信号时,从音频信号中提取特征参数(S304);基于解码语音包所得的线性滤波参数,将音频信号转换为滤波器语音激励信号(S306);根据特征参数以及解码语音包所得的长时滤波参数和线性滤波参数,对滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号(S308);基于增强后语音激励信号和线性滤波参数进行语音合成,得到语音增强信号(S310)。

Description

音频信号增强方法、装置、计算机设备、存储介质和计算机程序产品
本申请要求于2021年04月30日提交中国专利局,申请号为2021104841966,发明名称为“音频信号增强方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种音频信号增强方法、装置、计算机设备、存储介质和计算机程序产品。
背景技术
音频信号在编解码的过程中通常会引入量化噪声,使得解码合成出的语音出现失真。传统方案中通常采用基音滤波(Pitch Filter)或基于神经网络(Neural Network)的后处理技术来增强音频信号,以减少量化噪声对语音质量的影响。
然而,传统方案信号处理的速度较低,存在较大的时延,并且所能达到的语音质量的提升效果是有限的,导致对音频信号增强的时效性较差。
发明内容
根据本申请的各种实施例,提供一种音频信号增强方法、装置、计算机设备、存储介质和计算机程序产品。
一种音频信号增强方法,由计算机设备执行,所述方法包括:
对接收到的语音包依序进行解码,得到残差信号、长时滤波参数和线性滤波参数;对所述残差信号进行滤波,得到音频信号;
鄦所述音频信号为前向纠错帧信号时,从所述音频信号中提取特征参数;
基于所述线性滤波参数,将所述音频信号转换为滤波器语音激励信号;
根据所述特征参数、所述长时滤波参数和所述线性滤波参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号;
基于所述增强后语音激励信号和所述线性滤波参数进行语音合成,得到语音增强信号。
在一个实施例中,所述线性滤波参数包括线性滤波系数和能量增益值;所述基于所述线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对所述增强后语音激励信号进行线性合成滤波,包括:
基于所述线性滤波系数对线性预测滤波器进行参数配置;
获取在解码所述语音包之前所解码的历史语音包对应的能量增益值;
基于所述历史语音包对应的能量增益值和所述语音包对应的能量增益值,确定能量调整参数;
通过所述能量调整参数对所述历史语音包对应的历史长时滤波激励信号进行能量调整,得到调整后历史长时滤波激励信号;
将所述调整后历史长时滤波激励信号和所述增强后语音激励信号输入至参数配置后的线性预测滤波器,以使所述线性预测滤波器基于所述调整后历史长时滤波激励信号,对所述增强后语音激励信号进行线性合成滤波。
一种音频信号增强装置,所述装置包括:
语音包处理模块,用于对接收到的语音包依序进行解码,得到残差信号、长时滤波参数和线性滤波参数;对所述残差信号进行滤波,得到音频信号;
特征参数提取模块,用于当所述音频信号为前向纠错帧信号时,从所述音频信号中提取特征参数;
信号转换模块,用于基于所述线性滤波参数,将所述音频信号转换为滤波器语音激励信号;
语音增强模块,用于根据所述特征参数、所述长时滤波参数和所述线性滤波参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号;
语音合成模块,用于基于所述增强后语音激励信号和所述线性滤波参数进行语音合成,得到语音增强信号。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:
对接收到的语音包依序进行解码,得到残差信号、长时滤波参数和线性滤波参数;对所 述残差信号进行滤波,得到音频信号;
当所述音频信号为前向纠错帧信号时,从所述音频信号中提取特征参数;
基于所述线性滤波参数,将所述音频信号转换为滤波器语音激励信号;
根据所述特征参数、所述长时滤波参数和所述线性滤波参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号;
基于所述增强后语音激励信号和所述线性滤波参数进行语音合成,得到语音增强信号。
一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:
对接收到的语音包依序进行解码,得到残差信号、长时滤波参数和线性滤波参数;对所述残差信号进行滤波,得到音频信号;
当所述音频信号为前向纠错帧信号时,从所述音频信号中提取特征参数;
基于所述线性滤波参数,将所述音频信号转换为滤波器语音激励信号;
根据所述特征参数、所述长时滤波参数和所述线性滤波参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号;
基于所述增强后语音激励信号和所述线性滤波参数进行语音合成,得到语音增强信号。
一种计算机程序,所述计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中,计算机设备的处理器从所述计算机可读存储介质读取所述计算机指令,所述处理器执行所述计算机指令,使得所述计算机设备执行以下步骤:
对接收到的语音包依序进行解码,得到残差信号、长时滤波参数和线性滤波参数;对所述残差信号进行滤波,得到音频信号;
当所述音频信号为前向纠错帧信号时,从所述音频信号中提取特征参数;
基于所述线性滤波参数,将所述音频信号转换为滤波器语音激励信号;
根据所述特征参数、所述长时滤波参数和所述线性滤波参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号;
基于所述增强后语音激励信号和所述线性滤波参数进行语音合成,得到语音增强信号。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1为一个实施例中基于激励信号的语音生成模型示意图;
图2为一个实施例中音频信号增强方法的应用环境图;
图3为一个实施例中音频信号增强方法的流程示意图;
图4为一个实施例中音频信号传输流程示意图;
图5为一个实施例中长时预测滤波器的幅频响应图;
图6为一个实施例中语音包解码滤波步骤的流程示意图;
图7为一个实施例中长时逆滤波器的幅频响应图;
图8为一个实施例中信号增强模型示意图;
图9为另一个实施例中音频信号增强方法的流程示意图;
图10为另一个实施例中音频信号增强方法的流程示意图;
图11为一个实施例中音频信号增强装置的结构框图;
图12为另一个实施例中音频信号增强装置的结构框图;
图13为一个实施例中计算机设备的内部结构图;
图14为另一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
在对本申请提供的音频信号增强方法进行说明之前,先对语音生成模型进行说明,参考图1所示的基于激励信号的语音生成模型,其中基于激励信号的语音生成模型的物理理论基础是人的声音的发生过程,该过程包括:
(1)在气管处,产生一定能量的类噪声的冲击信号,该冲击信号对应于基于激励信号的语音生成模型中的激励信号。
(2)冲击信号冲击人的声带,产生类周期性的开合,通过口腔放大后,发出声音,所 发出的声音对应于基于激励信号的语音生成模型中的滤波器。
在实际过程中,考虑声音的特点,将基于激励信号的语音生成模型中的滤波器细分为长时预测(Long Term Prediction,LTP)滤波器和线性预测(Linear Predictive Coding,LPC)滤波器,其中LTP滤波器是利用语音的长时相关性来加强音频信号,LPC滤波器是利用语音的短时相关性来加强音频信号,具体地,针对浊音这种类周期性信号,在基于激励信号的语音生成模型中,激励信号将分别冲击LTP滤波器和LPC滤波器;针对清音这类非周期信号,激励信号将只冲击LPC滤波器。
本申请实施例提供的方案涉及人工智能的机器学习等技术,具体通过如下实施例进行说明:本申请提供的音频信号增强方法,由计算机设备执行,具体可以应用于如图2所示的应用环境中。其中,终端202通过网络与服务器204进行通信,终端202可接收服务器204所发送的语音包,或其他设备经过服务器204所转发的语音包,服务器204可接收终端所发送的语音包,或其他设备所发送的语音包。上述音频信号增强方法可以应用于终端202或服务器204,以执行于终端202为例进行说明,终端202对接收到的语音包依序进行解码,得到残差信号、长时滤波参数和线性滤波参数,对残差信号进行滤波,得到音频信号;当音频信号为前向纠错帧信号时,从音频信号中提取特征参数;基于线性滤波参数,将音频信号转换为滤波器语音激励信号;根据特征参数、长时滤波参数和线性滤波参数,对滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号;基于增强后语音激励信号和线性滤波参数进行语音合成,得到语音增强信号。
其中,终端202可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器204可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。
在一个实施例中,如图3所示,提供了一种音频信号增强方法,以该方法应用于图2中的计算机设备(终端或服务器)为例进行说明,包括以下步骤:
S302,对接收到的语音包依序进行解码,得到残差信号、长时滤波参数和线性滤波参数;对残差信号进行滤波,得到音频信号。
其中,所接收的语音包可以是基于前向纠错(Feedforward Error Correction,FEC)技术 的抗丢包场景中的语音包。
前向纠错技术是一种差错控制方式,它是指信号在被送入传输信道之前预先按一定的算法进行编码处理,加入带有信号本身特征的冗码,在接收端按照相应算法对接收到的信号进行解码,从而找出在传输过程中产生的错误码并将其纠正的技术。
冗码也可称为冗余信息,本申请实施例中,参考图4,信号发送端在对当前语音帧(简称为当前帧)音频信号进行编码时,可将前一语音帧(简称为前一帧)的音频信号信息作为冗余信息编码到当前帧音频信号所对应的语音包中,并在编码完成之后将当前帧音频信号对应的语音包发送到接收端,接收端接收该语音包,这样,即使在信号传输过程中发生故障,导致接收端未接收到某个语音包或者某个语音包出现误码,也可以通过对其的后一语音帧(简称为后一帧)音频信号所对应的语音包进行解码,从而获得丢失或误码的语音包所对应的音频信号,提高信号传输的可靠性。其中,接收端可以是图2中的终端202。
具体地,终端在接收到语音包时,将接收到的语音包存储到缓存中,然后从缓存中取出将要播放的语音帧对应的语音包,并对该语音包进行解码和滤波,得到音频信号,当该语音包为前一时刻所解码的历史语音包的相邻包、且前一时刻所解码的历史语音包未出现异常时,则直接将所得到的音频信号输出,或者对该音频信号进行音频信号增强处理,得到语音增强信号,并将语音增强信号输出;当该语音包非前一时刻所解码的历史语音包的相邻包时,或者该语音包为前一时刻所解码的历史语音包的相邻包、但前一时刻所解码的历史语音包出现异常时,则对该音频信号进行音频信号增强处理,得到语音增强信号,并将语音增强信号输出,其中,该语音增强信号中携带了前一时刻所解码的历史语音包的相邻包所对应的音频信号。
解码具体可以是熵解码,熵解码是与熵编码对应的解码方案,具体地,发送端在对音频信号进行编码时,可以采用熵编码方案对音频信号进行编码得到语音包,从而接收端在接收到语音包时,可以采用熵解码方案对所接收到的语音包进行解码。
在一个实施例中,终端在接收到语音包时,对所接收的语音包进行解码处理,得到残差信号和滤波器参数,基于滤波器参数对残差信号进行信号合成滤波,得到音频信号。其中,滤波器参数包括长时滤波参数和线性滤波参数。
具体地,发送端在对当前帧音频信号进行编码时,通过对前一帧音频信号分析得到滤波器参数,并基于所得到的滤波器参数对滤波器进行参数配置,然后通过所配置的滤波器对当 前帧音频信号进行分析滤波,得到当前帧音频信号的残差信号,并利用残差信号和分析所得到的滤波器参数对音频信号进行编码,得到语音包,并将该语音包发送给接收端,从而接收端在接收到语音包后,对所接收的语音包进行解码处理,得到残差信号和滤波器参数,并基于滤波器参数对残差信号进行信号合成滤波,得到音频信号。
在一个实施例中,滤波器参数包括线性滤波参数和长时滤波参数,发送端在对当前帧音频信号进行编码时,通过对前一帧音频信号分析得到线性滤波参数和长时滤波参数,然后基于线性滤波参数对当前帧音频信号进行线性分析滤波,得到线性滤波激励信号,然基于长时滤波参数对线性滤波激励信号进行长时分析滤波,得到当前帧音频信号对应的残差信号,并利用残差信号、分析所得到的线性滤波参数和长时滤波参数对当前帧音频信号进行编码,得到语音包,并将该语音包发送给接收端。
具体地,基于线性滤波参数对当前帧音频信号进行线性分析滤波具体包括:基于线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器音频信号进行线性分析滤波,得到线性滤波激励信号,其中,线性滤波参数包括线性滤波系数和能量增益值,线性滤波系数可记为LPC AR,能量增益值可记为LPC gain,线性预测滤波器的公式如下:
Figure PCTCN2022086960-appb-000001
其中,e(n)为当前帧音频信号所对应的线性滤波激励信号,s(n)为当前帧音频信号,p为每帧音频信号所包含的采样点的个数,a i为分析前一帧音频信号所得到的线性滤波系数,s adj(n-i)为当前帧音频信号s(n)的前一帧音频信号s(n-i)的能量调整后状态,s adj(n-i)可通过下式得到:
s adj(n-i)=gain adj·s(n-i)        (2)
其中,s(n-i)为当前帧音频信号s(n)的前一帧音频信号,gain adj为前一帧音频信号s(n-i)的能量调整参数,gain adj可通过下式得到:
Figure PCTCN2022086960-appb-000002
其中,gain(n)为当前帧音频信号对应的能量增益值,gain(n-i)为前一帧音频信号对应的能量增益值。
基于长时滤波参数对线性滤波激励信号进行长时分析滤波具体包括:基于长时滤波参数 对长时预测滤波器进行参数配置,通过参数配置后的长时预测滤波器对残差信号进行长时分析滤波,得到当前帧音频信号的对应的残差信号,其中,长时滤波参数包括基音周期和对应的幅度增益值,基音周期可记为LTP pitch,对应的幅度增益值可记为LTP gain,长时预测滤波器的频域表示如下,频域可记为Z域:
p(z)=1-γz -T        (4)
上式中,p(z)为长时预测滤波器的幅频响应,z为频域变换的旋转因子,γ为幅度增益值LTP gain,T为基音周期LTP pitch,图5示出了一个实施例中γ=1、T=80时所对应的长时预测滤波器的幅频响应图。
长时预测滤波器的时域表示如下:
δ(n)=e(n)-γe(n-T)        (5)
其中,δ(n)为当前帧音频信号所对应的残差信号,e(n)为当前帧音频信号所对应的线性滤波激励信号,γ为幅度增益值LTP gain,T为基音周期LTP pitch,e(n-T)为当前帧音频信号前一基音周期的音频信号对应的线性滤波激励信号。
在一个实施例中,终端解码得到的滤波器参数包括长时滤波参数和线性滤波参数,信号合成滤波包括基于长时滤波参数的长时合成滤波,以及基于线性滤波参数的线性合成滤波。终端在对语音包解码得到残差信号、长时滤波参数和线性滤波参数之后,基于长时滤波参数对残差信号进行长时合成滤波,得到长时滤波激励信号,然后基于线性滤波参数对长时滤波激励信号进行线性合成滤波,得到音频信号。
在一个实施例中,终端在得到残差信号之后,将所得到的残差信号分为多个子帧,得到多个子残差信号,针对每个子残差信号,分别基于相应的长时滤波参数对其进行长时合成滤波,得到每个子帧对应的长时滤波激励信号,然后将每个子帧所对应的长时滤波激励信号按照各子帧的时序进行组合,得到对应的长时滤波激励信号。
比如,一个语音包对应20ms的音频信号,即所得的残差信号为20ms,则可以将该残差信号分为4个子帧,得到4个5ms的子残差信号,针对每个5ms的子残差信号,分别基于相应的长时滤波参数对其进行长时合成滤波,得到4个5ms的长时滤波激励信号,然后将该4个5ms的长时滤波激励信号按照各子帧的时序进行组合,得到一个20ms的长时滤波激励信号。
在一个实施例中,终端在得到长时滤波激励信号之后,将所得到的长时滤波激励信号分 为多个子帧,得到多个子长时滤波激励信号,然后针对每个子长时滤波激励信号,分别基于相应的线性滤波参数对其进行线性合成滤波,得到每个子帧对应的子线性滤波激励信号,然后将每个子帧所对应的线性滤波激励信号按照各子帧的时序进行组合,得到对应的线性滤波激励信号。
比如,一个语音包对应20ms的音频信号,即所得的长时滤波激励信号为20ms,则可以将该长时滤波激励信号分为2个子帧,得到2个10ms的子长时滤波激励信号,针对每个10ms的子长时滤波激励信号,分别基于相应的线性滤波参数对其进行线性合成滤波,得到2个10ms的子音频信号,然后将该2个10ms的子音频信号按照各子帧的时序进行组合,得到一个20ms的音频信号。
S304,当音频信号为前向纠错帧信号时,从音频信号中提取特征参数。
其中,音频信号为前向纠错帧信号,是指该音频信号历史相邻帧的音频信号存在异常,历史相邻帧的音频信号存在异常具体包括:未接收到历史相邻帧的音频信号所对应的语音包,或者接收到历史相邻帧的音频信号所对应的语音包未能正常解码。特征参数包括倒谱特征参数。
在一个实施例中,终端在对接收到的语音包进行解码和滤波得到音频信号之后,确定在解码该语音包之前所解码的历史语音包是否出现数据异常,若所解码的历史语音包出现数据异常,则确定当前所经过解码和滤波所得的音频信号为前向纠错帧信号。
具体地,终端确定解码该语音包的前一时刻所解码的历史语音包对应的历史音频信号,是否为解码该语音包所得的音频信号的前一帧音频信号,若是,则确定该历史语音包未出现数据异常,若否,则确定该历史语音包出现数据异常。
本实施例中,终端通过确定在解码当前语音包之前所解码的历史语音包是否出现数据异常,从而确定出当前经过解码和滤波所得的音频信号是否为前向纠错帧信号,进而可以在音频信号是前向纠错帧信号时,对其进行音频信号增强处理,进一步提高音频信号的质量。
在一个实施例中,当解码所得的音频信号为前向纠错帧信号时,则从解码所得的音频信号中提取特征参数,所提取的特征参数具体可以是倒谱特征参数,具体包括以下步骤:对音频信号进行傅里叶变换,得到傅里叶变换后的音频信号;将傅里叶变换后的音频信号进行对数处理,得到对数结果;对得到的对数结果进行傅里叶逆变换,得到倒谱特征参数。具体可通过下式实现从音频信号中提取倒谱特征参数:
Figure PCTCN2022086960-appb-000003
其中,C(n)是解码和滤波后所得到音频信号S(n)的倒谱特征参数,S(F)是对音频信号S(n)进行傅里叶变换而得到的傅里叶变换后的音频信号。
上述实施例中,终端通过从音频信号中提取出倒谱特征参数,从而可以基于所提取出的倒谱特征参数对音频信号进行增强,提高了音频信号的质量。
在一个实施例中,当音频信号非前向纠错帧信号时,即当前解码和滤波后所得音频信号的前一帧音频信号未发生异常时,也可以从当前解码和滤波后所得音频信号中提取特征参数,以便对该当前解码和滤波后所得音频信号进行音频信号增强处理。
S306,基于线性滤波参数,将音频信号转换为滤波器语音激励信号。
具体地,终端在对语音包解码和滤波得到音频信号之后,还可以获取解码语音包时所得到的线性滤波参数,并基于线性滤波参数对所得到的音频信号进行线性分析滤波,从而实现将音频信号转换为滤波器语音激励信号。
在一个实施例中,S306具体包括以下步骤:基于线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对音频信号进行线性分解滤波,得到滤波器语音激励信号。
其中,线性分解滤波也称为线性分析滤波,本申请实施例中对音频信号进行线性分析滤波时,是对整帧的音频信号直接进行线性分析滤波,无需将整帧的音频信号进行分子帧处理。
具体地,终端可采用下式对音频信号进行线性分解滤波,得到滤波器语音激励信号:
Figure PCTCN2022086960-appb-000004
其中,D(n)是对语音包解码和滤波后所得到音频信号S(n)对应的滤波器语音激励信号,S(n)是对语音包解码和滤波后所得到的音频信号,S adj(n-i)为得到的音频信号S(n)的前一帧音频信号S(n-i)的能量调整后状态,p为每帧音频信号所包含的采样点的个数,A i是解码语音包所得到的线性滤波系数。
上述实施例,终端基于线性滤波参数将音频信号转换为滤波器语音激励信号,从而可以通过对滤波器语音激励信号进行增强以实现对音频信号的增强,提高了音频信号的质量。
S308,根据特征参数、长时滤波参数和线性滤波参数,对滤波器语音激励信号进行语音 增强处理,得到增强后语音激励信号。
其中,长时滤波参数包括基音周期和幅度增益值。
在一个实施例中,S308包括以下步骤:根据基音周期、幅度增益值、线性滤波参数和倒谱特征参数,对滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号。
具体地,对音频信号进行语音增强处理具体可以通过预先训练的信号增强模型来实现,信号增强模型为神经网络(Neural Network,NN)模型,该神经网络模型具体可以采用LSTM和CNN级的结构。
上述实施例中,终端根据基音周期、幅度增益值、线性滤波参数和倒谱特征参数,对滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号,进而可以基于增强后语音激励信号实现对音频信号的增强,提高了音频信号的质量。
在一个实施例中,终端将所获得的特征参数、长时滤波参数、线性滤波参数、滤波器语音激励信号输入预训练的信号增强模型,以使信号增强模型基于特征参数对滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号。
上述实施例中,终端通过预训练的信号增强模型实现对增强后语音激励信号,进而可以基于增强后语音激励信号实现对音频信号的增强,提高了音频信号的质量和对音频信号增强处理的效率。
需要说明的是,本申请实施例中,通过预先训练的信号增强模型对滤波器语音激励信号进行语音增强处理的过程中,是对整帧的滤波器语音激励信号进行语音增强处理的,无需对整帧的滤波器语音激励信号进行分子帧处理。
S310,基于增强后语音激励信号和线性滤波参数进行语音合成,得到语音增强信号。
其中,语音合成可以是基于线性滤波参数所进行的线性合成滤波。
在一个实施例中,终端在得到增强后语音激励信号之后,基于线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对增强后语音激励信号进行线性合成滤波,得到语音增强信号。
其中,线性滤波参数包括线性滤波系数和能量增益值,线性滤波系数可记为LPC AR,能量增益值可记为LPC gain,线性合成滤波为发送端对音频信号进行编码时所进行的线性分析滤波的逆过程,因此执行线性合成滤波的线性预测滤波器也称为线性逆滤波器,线性预测滤波器的时域表示如下:
Figure PCTCN2022086960-appb-000005
其中,S enh(n)为语音增强信号,D enh(n)是对滤波器语音激励信号D(n)进行语音增强处理后得到增强后语音激励信号,S adj(n-i)为得到的音频信号S(n)的前一帧音频信号S(n-i)的能量调整后状态,p为每帧音频信号所包含的采样点的个数,A i是解码语音包所得到的线性滤波系数。
音频信号S(n)的前一帧音频信号S(n-i)的能量调整后状态,S adj(n-i)可通过下式得到:
S adj(n-i)=gain adj·S(n-i)        (9)
上式中,S adj(n-i)为前一帧音频信号S(n-i)的能量调整后状态,gain adj为前一帧音频信号S(n-i)的能量调整参数。
本实施例中,终端通过对增强后语音激励信号进行线性合成滤波,从而可以得到语音增强信号,即实现了对音频信号的增强处理,提高了音频信号的质量。
需要说明的是,本申请实施例中语音合成的过程是对整帧的增强后语音激励信号进行语音合成,无需对整帧的增强后语音激励信号进行分子帧处理。
上述音频信号增强方法,终端当接收到语音包时,对语音包依序进行解码和滤波,得到音频信号,并音频信号为前向纠错帧信号时,从音频信号中提取特征参数,基于解码语音包所得的线性滤波系数,将音频信号转换为滤波器语音激励信号,从而根据特征参数以及解码语音包所得的长时滤波参数,对滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号,基于增强后语音激励信号和线性滤波参数进行语音合成,得到语音增强信号,从而在较少的时间内完成实现对音频信号的增强处理,且能达到较好的信号增强效果,提高了音频信号增强的时效性。
在一个实施例中,如图6所示,S302具体包括以下步骤:
S602,基于长时滤波参数对长时预测滤波器进行参数配置,通过参数配置后的长时预测滤波器对残差信号进行长时合成滤波,得到长时滤波激励信号。
其中,长时滤波参数包括基音周期和对应的幅度增益值,基音周期可记为LTP pitch,LTP pitch又可称为基音周期,对应的幅度增益值可记为LTP gain,通过参数配置后的长时预测滤波器对残差信号进行长时合成滤波,其中,长时合成滤波为发送端对音频信号进行编码时所进行的长时分析滤波的逆过程,因此执行长时合成滤波的长时预测滤波器也称为长时逆滤波 器,即采用长时逆滤波器对残差信号进行处理,与公式(1)对应的长时逆滤波器的频域表示如下:
Figure PCTCN2022086960-appb-000006
其中,p -1(z)为长时逆滤波器的幅频响应,z为频域变换的旋转因子,γ为幅度增益值LTP gain,T为基音周期LTP pitch,图7示出了一个实施例中γ=1、T=80时所对应的长时逆预测滤波器的幅频响应图。
与公式(10)对应的长时逆滤波器的时域表示如下:
E(n)=γE(n-T)+δ(n)        (11)
上式中,E(n)为语音包所对应的长时滤波激励信号,δ(n)语音包所对应的残差信号,γ为幅度增益值LTP gain,T为基音周期LTP pitch,E(n-T)为语音包前一基音周期的音频信号所对应的长时滤波激励信号。可以理解的是,本实施例中,接收端通过长时逆滤波器对残差信号进行长时合成滤波所得到的长时滤波激励信号E(n),与发送端编码时,通过线性滤波器对音频信号进行线性分析滤波所得到线性滤波激励信号e(n)相同。
S604,基于线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对长时滤波激励信号进行线性合成滤波,得到音频信号。
其中,线性滤波参数包括线性滤波系数和能量增益值,线性滤波系数可记为LPC AR,能量增益值可记为LPC gain,线性合成滤波为发送端对音频信号进行编码时所进行的线性分析滤波的逆过程,因此执行线性合成滤波的线性预测滤波器也称为线性逆滤波器,线性预测滤波器的时域表示如下:
Figure PCTCN2022086960-appb-000007
上式中,S(n)为语音包的所对应的音频信号,E(n)为语音包所对应的长时滤波激励信号,S adj(n-i)为得到音频信号S(n)的前一帧音频信号S(n-i)的能量调整后状态,p为每帧音频信号所包含的采样点的个数,A i是解码语音包所得到的线性滤波系数。
音频信号S(n)的前一帧音频信号S(n-i)的能量调整后状态,S adj(n-i)可通过下式得到:
Figure PCTCN2022086960-appb-000008
其中,gain adj为前一帧音频信号S(n-i)的能量调整参数,gain(n)为解码语音包所得到的能量增益值,gain(n-i)前一帧音频信号对应的能量增益值。
上述实施例中,终端基于长时滤波参数对残差信号进行长时合成滤波,得到长时滤波激励信号;基于解码所得的线性滤波参数对长时滤波激励信号进行线性合成滤波,得到音频信号,从而可以在音频信号非前向纠错帧信号时,可以直接将音频信号输出,在音频信号为前向纠错帧信号时,对音频信号进行增强后输出,提高了音频信号输出的时效性。
在一个实施例中,S604具体包括以下步骤:将长时滤波激励信号分为至少两个子帧,得到子长时滤波激励信号;对解码所得的线性滤波参数进行分组,得到至少两个线性滤波参数集合;基于线性滤波参数集合分别对至少两个线性预测滤波器进行参数配置;将所得的子长时滤波激励信号分别输入参数配置后的线性预测滤波器,以使线性预测滤波器基于线性滤波参数集合对子长时滤波激励信号进行线性合成滤波,得到与各子帧对应的子音频信号;按照各子帧的时序对子音频信号进行组合,得到音频信号。
其中,线性滤波参数集合有线性滤波系数集合和能量增益值集合两种类型。
具体地,针对每个子帧所对应的子长时滤波激励信号,在采用公式(12)所对应的线性逆滤波器进行线性合成滤波时,公式(12)中的S(n)为任一子帧所对应的子音频信号,E(n)为该子帧对应的长时滤波激励信号,S adj(n-i)为得到子音频信号S(n)的前一子帧的子音频信号的S(n-i)能量调整后状态,p为每子帧音频信号所包含的采样点的个数,A i是该子帧所对应的线性滤波系数集合;公式(13)中的gain adj是该子音频信号的前一子帧的子音频信号的能量调整参数,gain(n)该子音频信号的能量增益值,gain(n-i)是该子音频信号的前一子帧的子音频信号的能量增益值。
上述实施例中,终端通过将长时滤波激励信号分为至少两个子帧,得到子长时滤波激励信号;对解码所得的线性滤波参数进行分组,得到至少两个线性滤波参数集合;基于线性滤波参数集合分别对至少两个线性预测滤波器进行参数配置;将所得的子长时滤波激励信号分别输入参数配置后的线性预测滤波器,以使线性预测滤波器基于线性滤波参数集合对子长时滤波激励信号进行线性合成滤波,得到与各子帧对应的子音频信号;按照各子帧的时序对子音频信号进行组合,得到音频信号,从而可以确保所得的音频信号能够较好地还原出发送端所发送的音频信号,提高了还原出的音频信号的质量。
在一个实施例中,线性滤波参数包括线性滤波系数和能量增益值;S604还包括以下步骤: 针对长时滤波激励信号中的第一子帧对应的子长时滤波激励信号,获取历史长时滤波激励信号中与第一子帧对应的子长时滤波激励信号相邻的子帧的历史子长时滤波激励信号的能量增益值;基于历史子长时滤波激励信号对应的能量增益值和第一子帧对应的子长时滤波激励信号的能量增益值,确定子长时滤波激励信号对应的能量调整参数;通过能量调整参数对历史子长时滤波激励信号进行能量调整,得到能量调整后的历史子长时滤波激励信号。
其中,历史长时滤波激励信号为当前帧长时滤波激励信号的前一帧长时滤波激励信号,历史长时滤波激励信号中与第一子帧对应的子长时滤波激励信号相邻的子帧的历史子长时滤波激励信号,即为前一帧长时滤波激励信号的最后一子帧对应的子长时滤波激励信号。
例如,将当前帧的长时滤波激励信号分为两个子帧,得到第一子帧对应的子长时滤波激励信号和第二子帧对应的子长时滤波激励信号,则前一帧长时滤波激励信号的第二子帧对应的子长时滤波激励信号,与当前帧的第一子帧对应的子长时滤波激励信号为相邻子帧。
在一个实施例中,终端在得到能量调整后的历史子长时滤波激励信号之后,将所得的子长时滤波激励信号和能量调整后所得的历史子长时滤波激励信号输入至参数配置后的线性预测滤波器,以使线性预测滤波器基于线性滤波系数和能量调整后所得的历史子长时滤波激励信号,对第一子帧对应的子长时滤波激励信号进行线性合成滤波,得到第一子帧对应的子音频信号。
例如,一个语音包对应20ms的音频信号,即所得到的长时滤波激励信号为20ms,对语音包解码得到的AR系数为{A 1,A 2,…,A p-1,A p,A p+1,…A 2p-1,A 2p},对语音包解码所得到的能量增益值为{gain 1(n),gain 2(n)},可对长时滤波激励信号分为两个子帧,得到前10ms所对应的第一子滤波激励信号E 1(n)和后10ms所对应的第二子滤波激励信号E 2(n),对AR系数进行分组,得到AR系数集合1{A 1,A 2,…,A p-1,A p}和AR系数集合2{A p+1,…A 2p-1,A 2p},对能量增益值进行分组,得到能量增益值集合1{gain 1(n)}和能量增益值集合2{gain 2(n)},则第一子滤波激励信号E 1(n)的前一子帧的子滤波激励信号为E 2(n-i),第一子滤波激励信号E 1(n)的前一子帧的能量增益值集合为{gain 2(n-i)},第二子滤波激励信号E 2(n)的前一子帧的子滤波激励信号为E 1(n),第二子滤波激励信号E 2(n)的前一子帧的能量增益值集合为{gain 1(n)},那么第一子滤波激励信号E 1(n)对应的子音频信号可将对应参数代入公式(12)和公式(13)求得,第二子滤波激励信号E 2(n)对应的子音频信号可将对应参数代入公式(12)和公式(13)求得。
上述实施例中,终端针对长时滤波激励信号中的第一子帧对应的子长时滤波激励信号, 获取历史长时滤波激励信号中与第一子帧对应的子长时滤波激励信号相邻的子帧的历史子长时滤波激励信号的能量增益值;基于历史子长时滤波激励信号对应的能量增益值和第一子帧对应的子长时滤波激励信号的能量增益值,确定子长时滤波激励信号对应的能量调整参数;通过能量调整参数对历史子长时滤波激励信号进行能量调整,将所得的子长时滤波激励信号和能量调整后所得的历史子长时滤波激励信号输入至参数配置后的线性预测滤波器,以使线性预测滤波器基于线性滤波系数和能量调整后所得的历史子长时滤波激励信号,对第一子帧对应的子长时滤波激励信号进行线性合成滤波,得到第一子帧对应的子音频信号,从而可以确保所得的每一子帧音频信号能够较好地还原出发送端所发送的每一子帧音频信号,提高了还原出的音频信号的质量。
在一个实施例中,特征参数包括倒谱特征参数,S308包括以下步骤:对倒谱特征参数、长时滤波参数和线性滤波参数进行向量化处理,并拼接向量化处理所得的结果得到特征向量;将特征向量、滤波器语音激励信号输入预训练的信号增强模型;通过信号增强模型对特征向量进行特征提取,得到目标特征向量;基于目标特征向量对滤波器语音激励信号进行增强处理,得到增强后语音激励信号。
其中,信号增强模型为多级网络结构,具体包括第一特征拼接层、第二特征拼接层、第一神经网络层和第二神经网络层。目标特征向量为增强后的特征向量。
具体地,终端通过信号增强模型的第一特征拼接层对倒谱特征参数、长时滤波参数和线性滤波参数进行向量化处理,并拼接向量化处理所得的结果得到特征向量,然后将得到的特征向量输入到信号增强模型的第一神经网络层,通过第一神经网络层对特征向量进行特征提取,得到初级特征向量,并将初级特征向量和对线性滤波参数中的线性滤波系数进行傅里叶变换所得到的包络信息,输入信号增强模型的第二特征拼接层,拼接后初级特征向量,并将拼接后初级特征向量输入信号增强模型的第二神经网络层,通过第二神经网络层对拼接后初级特征向量进行特征提取,得到目标特征向量,然后基于目标特征向量对滤波器语音激励信号进行增强处理,得到增强后语音激励信号。
上述实施例中,终端通过对倒谱特征参数、长时滤波参数和线性滤波参数进行向量化处理,并拼接向量化处理所得的结果得到特征向量;将特征向量、滤波器语音激励信号输入预训练的信号增强模型;通过信号增强模型对特征向量进行特征提取,得到目标特征向量;基于目标特征向量对滤波器语音激励信号进行增强处理,得到增强后语音激励信号,从而可以 通过信号增强模型实现对音频信号的增强处理,提高了音频信号的质量和对音频信号增强处理的效率。
在一个实施例中,终端基于目标特征向量对滤波器语音激励信号进行增强处理,得到增强后语音激励信号,包括:对滤波器语音激励信号进行傅里叶变换,得到频域语音激励信号;基于目标特征向量对频域语音激励信号的幅值特征进行增强;对增强幅值特征的频域语音激励信号傅里叶逆变换,得到增强后语音激励信号。
具体地,终端在对滤波器语音激励信号进行傅里叶变换之后,得到频域语音激励信号,在基于目标特征向量对频域语音激励信号的幅值特征进行增强之后,结合未增强的频域语音激励信号的相位特征,对增强幅值特征的频域语音激励信号傅里叶逆变换,得到增强后语音激励信号。
如图8所示,两个特征拼接层分别为concat1和concat2,两个神经网络层分别为NN part1和NN part2,通过concat1将维度为40的倒谱特征参数Cepstrum、维度为1的基音周期LTP pitch和维度为1的幅度增益值LTP Gain拼接在一起,形成一个维度为42的特征向量,并将该维度为42的特征向量输入NN part1,NN part1由一个两层的卷积神经网络和两层全联接网络组成,第一层卷积核的维度是(1,128,3,1)第二层卷积核的维度是(128,128,3,1),全联接层的节点数为128和8,各层结尾的激活函数为Tanh函数,通过NN part1从特征向量提取高层特征,得到维度为1024的初级特征向量,然后通过concat2将维度为1024的初级特征向量,与对线性滤波参数中的线性滤波系数LPC AR进行傅里叶变换所得到的维度为161的包络信息Envelope拼接得到维度为1185的拼接后初级特征向量,并将维度为1185的拼接后初级特征向量输入NN part2,NN part 2是一个两层的全联接网络,节点数分别为256和161,各层结尾的激活函数为Tanh函数,通过NN part 2得到目标特征向量,然后基于目标特征向量,对滤波器语音激励信号傅里叶变换后所得频域语音激励信号的幅值特征Excitation进行增强,并对增强幅值特征Excitation的滤波器语音激励信号进行傅里叶逆变换,得到增强后语音激励信号D enh(n)。
上述实施例中,终端通过对滤波器语音激励信号进行傅里叶变换,得到频域语音激励信号;基于目标特征向量对频域语音激励信号的幅值特征进行增强;对增强幅值特征的频域语音激励信号傅里叶逆变换,得到增强后语音激励信号,从而可以在确保音频信号的相位信息不变的情况下,实现对音频信号的增强处理,提高了音频信号的质量。
在一个实施例中,线性滤波参数包括线性滤波系数和能量增益值;终端基于线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对增强后语音激励信号进行线性合成滤波的步骤包括:基于线性滤波系数对线性预测滤波器进行参数配置;获取在解码语音包之前所解码的历史语音包对应的能量增益值;基于历史语音包对应的能量增益值和语音包对应的能量增益值,确定能量调整参数;通过能量调整参数对历史语音包对应的历史长时滤波激励信号进行能量调整,得到调整后历史长时滤波激励信号;将调整后历史长时滤波激励信号和增强后语音激励信号输入至参数配置后的线性预测滤波器,以使线性预测滤波器基于调整后历史长时滤波激励信号,对增强后语音激励信号进行线性合成滤波。
其中,历史语音包对应的历史音频信号为当前语音包对应的当前帧音频信号的前一帧音频信号。历史语音包对应的能量增益值可以是历史语音的整帧音频信号所对应的能量增益值,也可以是历史语音包的部分子帧音频信号所对应的能量增益值。
具体地,当音频信号非前向纠错帧信号时,即当前帧音频信号的前一帧音频信号在已通过终端对历史语音包进行正常解码而得到,则可以获取终端对历史语音包解码时所得到的历史语音包的能量增益值,并基于历史语音包的能量增益值确定能量调整参数;当音频信号为前向纠错帧时,即当前帧音频信号的前一帧音频信号未能通过终端对历史语音包进行正常解码而得到,则基于预设的能量增益补偿机制,确定前一帧音频信号对应的补偿能量增益值,并将该补偿能量增益值确定为历史语音包的能量增益值,以基于历史语音包的能量增益值确定能量调整参数。
在一个实施例中,当音频信号非前向纠错帧信号时,前一帧音频信号S(n-i)的能量调整参数gain adj可通过下式计算得到:
Figure PCTCN2022086960-appb-000009
其中,gain adj为前一帧音频信号S(n-i)的能量调整参数,gain(n-i)前一帧音频信号S(n-i)的能量增益值,gain(n)为当前帧音频信号的能量增益值。公式(14)即为基于历史语音的整帧音频信号所对应的能量增益值计算能量调整参数。
在一个实施例中,当音频信号非前向纠错帧信号时,前一帧音频信号S(n-i)的能量调整参数gain adj可通过下式得到:
Figure PCTCN2022086960-appb-000010
其中,其中,gain adj为前一帧音频信号S(n-i)的能量调整参数,gain m(n-i)前一帧音频信号S(n-i)的第m子帧的能量增益值,gain m(n)为当前帧音频信号的能第m子帧的能量增益值,m每个音频信号对应的子帧数,{gain 1(n)+…+gain(n)}/m为当前帧音频信号的能量增益值。公式(15)即为基于历史语音的部分子帧音频信号所对应的能量增益值计算能量调整参数。
上述实施例中,终端通过基于线性滤波系数对线性预测滤波器进行参数配置;获取在解码语音包之前所解码的历史语音包对应的能量增益值;基于历史语音包对应的能量增益值和语音包对应的能量增益值,确定能量调整参数;通过能量调整参数对历史语音包对应的历史长时滤波激励信号进行能量调整,得到调整后历史长时滤波激励信号;将调整后历史长时滤波激励信号和增强后语音激励信号输入至参数配置后的线性预测滤波器,以使线性预测滤波器基于调整后历史长时滤波激励信号,对增强后语音激励信号进行线性合成滤波,从而可以平滑不同帧间的音频信号,提高了不同帧的音频信号所组成的语音的质量。
在一个实施例中,如图9所示,提供了一种音频信号增强方法,以该方法应用于图2中的计算机设备(终端或服务器)为例进行说明,包括以下步骤:
S902,对语音包进行解码,得到残差信号、长时滤波参数和线性滤波参数。
S904,基于长时滤波参数对长时预测滤波器进行参数配置,通过参数配置后的长时预测滤波器对残差信号进行长时合成滤波,得到长时滤波激励信号。
S906,将长时滤波激励信号分为至少两个子帧,得到子长时滤波激励信号。
S908,对解线性滤波参数进行分组,得到至少两个线性滤波参数集合。
S910,基于线性滤波参数集合分别对至少两个线性预测滤波器进行参数配置。
S912,将所得的子长时滤波激励信号分别输入参数配置后的线性预测滤波器,以使线性预测滤波器基于线性滤波参数集合对子长时滤波激励信号进行线性合成滤波,得到与各子帧对应的子音频信号。
S914,按照各子帧的时序对子音频信号进行组合,得到音频信号。
S916,确定在解码语音包之前所解码的历史语音包是否出现数据异常。
S918,若历史语音包出现数据异常时,则确定经过解码和滤波所得的音频信号为前向纠 错帧信号。
S920,当音频信号为前向纠错帧信号时,对音频信号进行傅里叶变换,得到傅里叶变换后的音频信号;将傅里叶变换后的音频信号进行对数处理,得到对数结果;对对数结果进行傅里叶逆变换,得到倒谱特征参数。
S922,基于线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对音频信号进行线性分解滤波,得到滤波器语音激励信号。
S924,将特征参数、长时滤波参数、线性滤波参数、线性滤波参数、滤波器语音激励信号输入预训练的信号增强模型,以使信号增强模型基于特征参数对滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号。
S926,基于线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对增强后语音激励信号进行线性合成滤波,得到语音增强信号。
本申请还提供一种应用场景,该应用场景应用上述的音频信号增强方法。具体地,该音频信号增强方法在该应用场景的应用如下:
以Fs为16000Hz的宽带信号为例进行说明,可以理解的是本申请也适用于其他采样率的场景,比如Fs为8000Hz、32000Hz或48000Hz。音频信号的帧长设置为20ms;对于Fs=16000Hz,相当于每帧包含320个样本点。参考图10,终端在接收到一帧音频信号对应的语音包后,对该语音包进行熵解码,得到δ(n)、LTP pitch、LTP gain、LPC AR和LPC gain,基于LTP pitch和LTP gain对δ(n)进行LTP合成滤波得到E(n),基于LPC AR和LPC gain对E(n)的各个子帧分别进行LPC合成滤波,并对LPC合成滤波结果进行组合得到一帧S(n),然后对S(n)进行倒谱分析,得到C(n),并基于LPC AR和LPC gain对整帧的S(n)进行LPC分解滤波,得到整帧D(n),将LTP pitch、LTP gain、LPC AR傅里叶变换后的包络信息、C(n)和D(n)输入到预先训练的信号增强模型NN postfilter,通过NN postfilter对整帧D(n)增强,得到整帧D enh(n),基于LPC AR和LPC gain对整帧的D enh(n)进行LPC合成滤波,得到S enh(n)。
应该理解的是,虽然图3、图4、图6、图9和图10的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图3、图4、图6、图9和图10中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者 阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图11所示,提供了一种音频信号增强装置,该装置可以采用软件模块或硬件模块,或者是二者的结合成为计算机设备的一部分,该装置具体包括:语音包处理模块1102、特征参数提取模块1104、信号转换模块1106、语音增强模块1108和语音合成模块1110,其中:
语音包处理模块1102,用于对接收到的语音包依序进行解码和滤波,得到残差信号、长时滤波参数和线性滤波参数;对残差信号进行滤波得到音频信号。
特征参数提取模块1104,用于当音频信号为前向纠错帧信号时,从音频信号中提取特征参数。
信号转换模块1106,用于基于线性滤波参数,将音频信号转换为滤波器语音激励信号。
语音增强模块1108,用于根据特征参数、长时滤波参数和线性滤波参数,对滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号。
语音合成模块1110,用于基于增强后语音激励信号和线性滤波参数进行语音合成,得到语音增强信号。
上述实施例中,计算机设备通过对接收到的语音包依序进行解码得到残差信号、长时滤波参数和线性滤波参数,并对残差信号进行滤波,得到音频信号,并在音频信号为前向纠错帧信号时,从音频信号中提取特征参数,基于解码语音包所得的线性滤波系数,将音频信号转换为滤波器语音激励信号,从而根据特征参数以及解码语音包所得的长时滤波参数,对滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号,基于增强后语音激励信号和线性滤波参数进行语音合成,得到语音增强信号,从而在较少的时间内完成实现对音频信号的增强处理,且能达到较好的信号增强效果,提高了音频信号增强的时效性。
在一个实施例中,语音包处理模块1102,还用于:基于长时滤波参数对长时预测滤波器进行参数配置,通过参数配置后的长时预测滤波器对残差信号进行长时合成滤波,得到长时滤波激励信号;基于线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对长时滤波激励信号进行线性合成滤波,得到音频信号。
上述实施例中,终端基于长时滤波参数对残差信号进行长时合成滤波,得到长时滤波激 励信号;基于解码所得的线性滤波参数对长时滤波激励信号进行线性合成滤波,得到音频信号,从而可以在音频信号非前向纠错帧信号时,可以直接将音频信号输出,在音频信号为前向纠错帧信号时,对音频信号进行增强后输出,提高了音频信号输出的时效性。
在一个实施例中,语音包处理模块1102,还用于:将长时滤波激励信号分为至少两个子帧,得到子长时滤波激励信号;对线性滤波参数进行分组,得到至少两个线性滤波参数集合;基于线性滤波参数集合分别对至少两个线性预测滤波器进行参数配置;将所得的子长时滤波激励信号分别输入参数配置后的线性预测滤波器,以使线性预测滤波器基于线性滤波参数集合对子长时滤波激励信号进行线性合成滤波,得到与各子帧对应的子音频信号;按照各子帧的时序对子音频信号进行组合,得到音频信号。
上述实施例中,终端通过将长时滤波激励信号分为至少两个子帧,得到子长时滤波激励信号;对线性滤波参数进行分组,得到至少两个线性滤波参数集合;基于线性滤波参数集合分别对至少两个线性预测滤波器进行参数配置;将所得的子长时滤波激励信号分别输入参数配置后的线性预测滤波器,以使线性预测滤波器基于线性滤波参数集合对子长时滤波激励信号进行线性合成滤波,得到与各子帧对应的子音频信号;按照各子帧的时序对子音频信号进行组合,得到音频信号,从而可以确保所得的音频信号能够较好地还原出发送端所发送的音频信号,提高了还原出的音频信号的质量。
在一个实施例中,线性滤波参数包括线性滤波系数和能量增益值;语音包处理模块1102,还用于:针对长时滤波激励信号中的第一子帧对应的子长时滤波激励信号,获取历史长时滤波激励信号中与第一子帧对应的子长时滤波激励信号相邻的子帧的历史子长时滤波激励信号对应的能量增益值;基于历史子长时滤波激励信号对应的能量增益值和第一子帧对应的子长时滤波激励信号的能量增益值,确定子长时滤波激励信号对应的能量调整参数;通过能量调整参数对历史子长时滤波激励信号进行能量调整;将所得的子长时滤波激励信号和能量调整后所得的历史子长时滤波激励信号输入至参数配置后的线性预测滤波器,以使线性预测滤波器基于线性滤波系数和能量调整后所得的历史子长时滤波激励信号,对第一子帧对应的子长时滤波激励信号进行线性合成滤波,得到第一子帧对应的子音频信号。
上述实施例中,终端针对长时滤波激励信号中的第一子帧对应的子长时滤波激励信号,获取历史长时滤波激励信号中与第一子帧对应的子长时滤波激励信号相邻的子帧的历史子长时滤波激励信号的能量增益值;基于历史子长时滤波激励信号对应的能量增益值和第一子帧 对应的子长时滤波激励信号的能量增益值,确定子长时滤波激励信号对应的能量调整参数;通过能量调整参数对历史子长时滤波激励信号进行能量调整,将所得的子长时滤波激励信号和能量调整后所得的历史子长时滤波激励信号输入至参数配置后的线性预测滤波器,以使线性预测滤波器基于线性滤波系数和能量调整后所得的历史子长时滤波激励信号,对第一子帧对应的子长时滤波激励信号进行线性合成滤波,得到第一子帧对应的子音频信号,从而可以确保所得的每一子帧音频信号能够较好地还原出发送端所发送的每一子帧音频信号,提高了还原出的音频信号的质量。
在一个实施例中,如图12所示,装置还包括:数据异常确定模块1112和前向纠错帧信号确定模块1114,其中:数据异常确定模块1112,用于确定在解码语音包之前所解码的历史语音包是否出现数据异常;前向纠错帧信号确定模块1114,用于若历史语音包出现数据异常时,则确定经过解码和滤波所得的音频信号为前向纠错帧信号。
上述实施例中,终端通过确定在解码当前语音包之前所解码的历史语音包是否出现数据异常,从而确定出当前经过解码和滤波所得的音频信号是否为前向纠错帧信号,进而可以在音频信号是前向纠错帧信号时,对其进行音频信号增强处理,进一步提高音频信号的质量。
在一个实施例中,特征参数包括倒谱特征参数;特征参数提取模块1104,还用于:对音频信号进行傅里叶变换,得到傅里叶变换后的音频信号;将傅里叶变换后的音频信号进行对数处理,得到对数结果;对对数结果进行傅里叶逆变换,得到倒谱特征参数。
上述实施例中,终端通过从音频信号中提取出倒谱特征参数,从而可以基于所提取出的倒谱特征参数对音频信号进行增强,提高了音频信号的质量。
在一个实施例中,长时滤波参数包括基音周期和幅度增益值;语音增强模块1108,还用于:根据基音周期、幅度增益值、线性滤波参数和倒谱特征参数,对滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号。
上述实施例中,终端根据基音周期、幅度增益值、线性滤波参数和倒谱特征参数,对滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号,进而可以基于增强后语音激励信号实现对音频信号的增强,提高了音频信号的质量。
在一个实施例中,信号转换模块1106,还用于:基于线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对音频信号进行线性分解滤波,得到滤波器语音激励信号。
上述实施例,终端基于线性滤波参数将音频信号转换为滤波器语音激励信号,从而可以通过对滤波器语音激励信号进行增强以实现对音频信号的增强,提高了音频信号的质量。
在一个实施例中,语音增强模块1108,还用于:将特征参数、长时滤波参数、线性滤波参数、滤波器语音激励信号输入预训练的信号增强模型,以使信号增强模型基于特征参数对滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号。
上述实施例中,终端通过预训练的信号增强模型实现对增强后语音激励信号,进而可以基于增强后语音激励信号实现对音频信号的增强,提高了音频信号的质量和对音频信号增强处理的效率。
在一个实施例中,特征参数包括倒谱特征参数;语音增强模块1108,还用于:对倒谱特征参数、长时滤波参数和线性滤波参数进行向量化处理,并拼接向量化处理所得的结果得到特征向量;将特征向量、滤波器语音激励信号输入预训练的信号增强模型;通过信号增强模型对特征向量进行特征提取,得到目标特征向量;基于目标特征向量对滤波器语音激励信号进行增强处理,得到增强后语音激励信号。
上述实施例中,终端通过对倒谱特征参数、长时滤波参数和线性滤波参数进行向量化处理,并拼接向量化处理所得的结果得到特征向量;将特征向量、滤波器语音激励信号输入预训练的信号增强模型;通过信号增强模型对特征向量进行特征提取,得到目标特征向量;基于目标特征向量对滤波器语音激励信号进行增强处理,得到增强后语音激励信号,从而可以通过信号增强模型实现对音频信号的增强处理,提高了音频信号的质量和对音频信号增强处理的效率。
在一个实施例中,语音增强模块1108,还用于:对滤波器语音激励信号进行傅里叶变换,得到频域语音激励信号;基于目标特征向量对频域语音激励信号的幅值特征进行增强;对增强幅值特征的频域语音激励信号傅里叶逆变换,得到增强后语音激励信号。
上述实施例中,终端通过对滤波器语音激励信号进行傅里叶变换,得到频域语音激励信号;基于目标特征向量对频域语音激励信号的幅值特征进行增强;对增强幅值特征的频域语音激励信号傅里叶逆变换,得到增强后语音激励信号,从而可以在确保音频信号的相位信息不变的情况下,实现对音频信号的增强处理,提高了音频信号的质量。
在一个实施例中,语音合成模块1110,还用于:基于线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对增强后语音激励信号进行线性合成滤波, 得到语音增强信号。
本实施例中,终端通过对增强后语音激励信号进行线性合成滤波,从而可以得到语音增强信号,即实现了对音频信号的增强处理,提高了音频信号的质量。
在一个实施例中,线性滤波参数包括线性滤波系数和能量增益值;语音合成模块1110,还用于:基于线性滤波系数对线性预测滤波器进行参数配置;获取在解码语音包之前所解码的历史语音包对应的能量增益值;基于历史语音包对应的能量增益值和语音包对应的能量增益值,确定能量调整参数;通过能量调整参数对历史语音包对应的历史长时滤波激励信号进行能量调整,得到调整后历史长时滤波激励信号;将调整后历史长时滤波激励信号和增强后语音激励信号输入至参数配置后的线性预测滤波器,以使线性预测滤波器基于调整后历史长时滤波激励信号,对增强后语音激励信号进行线性合成滤波。
上述实施例中,终端通过基于线性滤波系数对线性预测滤波器进行参数配置;获取在解码语音包之前所解码的历史语音包对应的能量增益值;基于历史语音包对应的能量增益值和语音包对应的能量增益值,确定能量调整参数;通过能量调整参数对历史语音包对应的历史长时滤波激励信号进行能量调整,得到调整后历史长时滤波激励信号;将调整后历史长时滤波激励信号和增强后语音激励信号输入至参数配置后的线性预测滤波器,以使线性预测滤波器基于调整后历史长时滤波激励信号,对增强后语音激励信号进行线性合成滤波,从而可以平滑不同帧间的音频信号,提高了不同帧的音频信号所组成的语音的质量。
关于音频信号增强装置的具体限定可以参见上文中对于音频信号增强方法的限定,在此不再赘述。上述音频信号增强装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图13所示。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储语音包数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种音频信号增强方法。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图14所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、运营商网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种音频信号增强方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图13或图14中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,还提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机可读存储介质,存储有计算机程序,该计算机程序被处理器执行时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各方法实施例中的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory, ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种音频信号增强方法,由计算机设备执行,其特征在于,所述方法包括:
    对接收到的语音包依序进行解码,得到残差信号、长时滤波参数和线性滤波参数;对所述残差信号进行滤波,得到音频信号;
    当所述音频信号为前向纠错帧信号时,从所述音频信号中提取特征参数;
    基于所述线性滤波参数,将所述音频信号转换为滤波器语音激励信号;
    根据所述特征参数、所述长时滤波参数和所述线性滤波参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号;
    基于所述增强后语音激励信号和所述线性滤波参数进行语音合成,得到语音增强信号。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述残差信号进行滤波,得到音频信号,包括:
    基于所述长时滤波参数对长时预测滤波器进行参数配置,通过参数配置后的长时预测滤波器对所述残差信号进行长时合成滤波,得到长时滤波激励信号;
    基于所述线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对所述长时滤波激励信号进行线性合成滤波,得到音频信号。
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对所述长时滤波激励信号进行线性合成滤波,得到音频信号,包括:
    将所述长时滤波激励信号分为至少两个子帧,得到子长时滤波激励信号;
    对所述线性滤波参数进行分组,得到至少两个线性滤波参数集合;
    基于所述线性滤波参数集合分别对至少两个线性预测滤波器进行参数配置;
    将所得的子长时滤波激励信号分别输入参数配置后的线性预测滤波器,以使所述线性预测滤波器基于所述线性滤波参数集合对所述子长时滤波激励信号进行线性合成滤波,得到与各子帧对应的子音频信号;
    按照各所述子帧的时序对所述子音频信号进行组合,得到音频信号。
  4. 根据权利要求3所述的方法,其特征在于,所述线性滤波参数包括线性滤波系数和能量增益值;所述方法还包括:
    针对所述长时滤波激励信号中的第一子帧对应的子长时滤波激励信号,获取历史长时滤 波激励信号中与所述第一子帧对应的子长时滤波激励信号相邻的子帧的历史子长时滤波激励信号的能量增益值;
    基于所述历史子长时滤波激励信号对应的能量增益值和所述第一子帧对应的子长时滤波激励信号的能量增益值,确定所述子长时滤波激励信号对应的能量调整参数;
    通过所述能量调整参数对所述历史子长时滤波激励信号进行能量调整;
    所述将所得的子长时滤波激励信号分别输入参数配置后的线性预测滤波器,以使所述线性预测滤波器基于所述线性滤波参数集合对所述子长时滤波激励信号进行线性合成滤波,得到与各子帧对应的子音频信号,包括:
    将所得的子长时滤波激励信号和能量调整后所得的历史子长时滤波激励信号输入至参数配置后的线性预测滤波器,以使所述线性预测滤波器基于所述线性滤波系数和所述能量调整后所得的历史子长时滤波激励信号,对所述第一子帧对应的子长时滤波激励信号进行线性合成滤波,得到第一子帧对应的子音频信号。
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    确定在解码所述语音包之前所解码的历史语音包是否出现数据异常;
    若所述历史语音包出现数据异常时,则确定经过解码和滤波所得的所述音频信号为前向纠错帧信号。
  6. 根据权利要求1所述的方法,其特征在于,所述特征参数包括倒谱特征参数;所述从所述音频信号中提取特征参数,包括:
    对所述音频信号进行傅里叶变换,得到傅里叶变换后的音频信号;
    将所述傅里叶变换后的音频信号进行对数处理,得到对数结果;
    对所述对数结果进行傅里叶逆变换,得到倒谱特征参数。
  7. 根据权利要求6所述的方法,其特征在于,所述长时滤波参数包括基音周期和幅度增益值;
    所述根据所述特征参数、所述长时滤波参数和所述线性滤波参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号,包括:
    根据所述基音周期、幅度增益值、所述线性滤波参数和所述倒谱特征参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号。
  8. 根据权利要求1所述的方法,其特征在于,所述基于所述线性滤波参数,将所述音频 信号转换为滤波器语音激励信号,包括:
    基于所述线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对所述音频信号进行线性分解滤波,得到滤波器语音激励信号。
  9. 根据权利要求1所述的方法,其特征在于,所述根据所述特征参数、所述长时滤波参数和所述线性滤波参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号,包括:
    将所述特征参数、所述长时滤波参数、所述线性滤波参数、所述滤波器语音激励信号输入预训练的信号增强模型,以使所述信号增强模型基于所述特征参数对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号。
  10. 根据权利要求9所述的方法,其特征在于,所述特征参数包括倒谱特征参数;所述将所述特征参数、所述长时滤波参数、所述线性滤波参数、所述滤波器语音激励信号输入预训练的信号增强模型,以使所述信号增强模型基于所述特征参数对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号,包括:
    对所述倒谱特征参数、所述长时滤波参数和所述线性滤波参数进行向量化处理,并拼接向量化处理所得的结果得到特征向量;
    将所述特征向量、所述滤波器语音激励信号输入预训练的信号增强模型;
    通过所述信号增强模型对所述特征向量进行特征提取,得到目标特征向量;
    基于所述目标特征向量对所述滤波器语音激励信号进行增强处理,得到增强后语音激励信号。
  11. 根据权利要求10所述的方法,其特征在于,所述基于所述目标特征向量对所述滤波器语音激励信号进行增强处理,得到增强后语音激励信号,包括:
    对所述滤波器语音激励信号进行傅里叶变换,得到频域语音激励信号;
    基于所述目标特征向量对所述频域语音激励信号的幅值特征进行增强;
    对增强所述幅值特征的频域语音激励信号傅里叶逆变换,得到增强后语音激励信号。
  12. 根据权利要求1所述的方法,其特征在于,所述基于所述增强后语音激励信号和所述线性滤波参数进行语音合成,得到语音增强信号,包括:
    基于所述线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对所述增强后语音激励信号进行线性合成滤波,得到语音增强信号。
  13. 根据权利要求12所述的方法,其特征在于,所述线性滤波参数包括线性滤波系数和能量增益值;所述基于所述线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对所述增强后语音激励信号进行线性合成滤波,包括:
    基于所述线性滤波系数对线性预测滤波器进行参数配置;
    获取在解码所述语音包之前所解码的历史语音包对应的能量增益值;
    基于所述历史语音包对应的能量增益值和所述语音包对应的能量增益值,确定能量调整参数;
    通过所述能量调整参数对所述历史语音包对应的历史长时滤波激励信号进行能量调整,得到调整后历史长时滤波激励信号;
    将所述调整后历史长时滤波激励信号和所述增强后语音激励信号输入至参数配置后的线性预测滤波器,以使所述线性预测滤波器基于所述调整后历史长时滤波激励信号,对所述增强后语音激励信号进行线性合成滤波。
  14. 一种音频信号增强装置,其特征在于,所述装置包括:
    语音包处理模块,用于对接收到的语音包依序进行解码,得到残差信号、长时滤波参数和线性滤波参数;对所述残差信号进行滤波,得到音频信号;
    特征参数提取模块,用于当所述音频信号为前向纠错帧信号时,从所述音频信号中提取特征参数;
    信号转换模块,用于基于所述线性滤波参数,将所述音频信号转换为滤波器语音激励信号;
    语音增强模块,用于根据所述特征参数、所述长时滤波参数和所述线性滤波参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号;
    语音合成模块,用于基于所述增强后语音激励信号和所述线性滤波参数进行语音合成,得到语音增强信号。
  15. 根据权利要求14所述的装置,其特征在于,所述语音包处理模块还用于:
    基于所述长时滤波参数对长时预测滤波器进行参数配置,通过参数配置后的长时预测滤波器对所述残差信号进行长时合成滤波,得到长时滤波激励信号;
    基于所述线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对所述长时滤波激励信号进行线性合成滤波,得到音频信号。
  16. 根据权利要求15所述的装置,其特征在于,所述语音包处理模块还用于:
    将所述长时滤波激励信号分为至少两个子帧,得到子长时滤波激励信号;
    对所述线性滤波参数进行分组,得到至少两个线性滤波参数集合;
    基于所述线性滤波参数集合分别对至少两个线性预测滤波器进行参数配置;
    将所得的子长时滤波激励信号分别输入参数配置后的线性预测滤波器,以使所述线性预测滤波器基于所述线性滤波参数集合对所述子长时滤波激励信号进行线性合成滤波,得到与各子帧对应的子音频信号;
    按照各所述子帧的时序对所述子音频信号进行组合,得到音频信号。
  17. 根据权利要求16所述的装置,其特征在于,所述线性滤波参数包括线性滤波系数和能量增益值;所述语音包处理模块还用于:
    针对所述长时滤波激励信号中的第一子帧对应的子长时滤波激励信号,获取历史长时滤波激励信号中与所述第一子帧对应的子长时滤波激励信号相邻的子帧的历史子长时滤波激励信号的能量增益值;
    基于所述历史子长时滤波激励信号对应的能量增益值和所述第一子帧对应的子长时滤波激励信号的能量增益值,确定所述子长时滤波激励信号对应的能量调整参数;
    通过所述能量调整参数对所述历史子长时滤波激励信号进行能量调整;
    将所得的子长时滤波激励信号和能量调整后所得的历史子长时滤波激励信号输入至参数配置后的线性预测滤波器,以使所述线性预测滤波器基于所述线性滤波系数和所述能量调整后所得的历史子长时滤波激励信号,对所述第一子帧对应的子长时滤波激励信号进行线性合成滤波,得到第一子帧对应的子音频信号。
  18. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至13中任一项所述的方法的步骤。
  19. 一种计算机可读存储介质,存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至13中任一项所述的方法的步骤。
  20. 一种计算机程序产品,包括计算机程序,其特征在于,该计算机程序被处理器执行时实现权利要求1至13中任一项所述的方法的步骤。
PCT/CN2022/086960 2021-04-30 2022-04-15 音频信号增强方法、装置、计算机设备、存储介质和计算机程序产品 WO2022228144A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2023535590A JP2023553629A (ja) 2021-04-30 2022-04-15 オーディオ信号強化方法、装置、コンピュータ機器及びコンピュータプログラム
EP22794615.9A EP4297025A1 (en) 2021-04-30 2022-04-15 Audio signal enhancement method and apparatus, computer device, storage medium, and computer program product
US18/076,116 US20230099343A1 (en) 2021-04-30 2022-12-06 Audio signal enhancement method and apparatus, computer device, storage medium and computer program product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110484196.6 2021-04-30
CN202110484196.6A CN113763973A (zh) 2021-04-30 2021-04-30 音频信号增强方法、装置、计算机设备和存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/076,116 Continuation US20230099343A1 (en) 2021-04-30 2022-12-06 Audio signal enhancement method and apparatus, computer device, storage medium and computer program product

Publications (1)

Publication Number Publication Date
WO2022228144A1 true WO2022228144A1 (zh) 2022-11-03

Family

ID=78786944

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/086960 WO2022228144A1 (zh) 2021-04-30 2022-04-15 音频信号增强方法、装置、计算机设备、存储介质和计算机程序产品

Country Status (5)

Country Link
US (1) US20230099343A1 (zh)
EP (1) EP4297025A1 (zh)
JP (1) JP2023553629A (zh)
CN (1) CN113763973A (zh)
WO (1) WO2022228144A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763973A (zh) * 2021-04-30 2021-12-07 腾讯科技(深圳)有限公司 音频信号增强方法、装置、计算机设备和存储介质
CN113938749B (zh) * 2021-11-30 2023-05-05 北京百度网讯科技有限公司 音频数据处理方法、装置、电子设备和存储介质
CN116994587B (zh) * 2023-09-26 2023-12-08 成都航空职业技术学院 一种培训监管系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714820A (zh) * 2013-12-27 2014-04-09 广州华多网络科技有限公司 参数域的丢包隐藏方法及装置
CN105765651A (zh) * 2013-10-31 2016-07-13 弗朗霍夫应用科学研究促进协会 用于使用基于时域激励信号的错误隐藏提供经解码的音频信息的音频解码器及方法
CN107248411A (zh) * 2016-03-29 2017-10-13 华为技术有限公司 丢帧补偿处理方法和装置
CN111554308A (zh) * 2020-05-15 2020-08-18 腾讯科技(深圳)有限公司 一种语音处理方法、装置、设备及存储介质
CN112489665A (zh) * 2020-11-11 2021-03-12 北京融讯科创技术有限公司 语音处理方法、装置以及电子设备
WO2021050155A1 (en) * 2019-09-09 2021-03-18 Qualcomm Incorporated Artificial intelligence based audio coding
CN113763973A (zh) * 2021-04-30 2021-12-07 腾讯科技(深圳)有限公司 音频信号增强方法、装置、计算机设备和存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105765651A (zh) * 2013-10-31 2016-07-13 弗朗霍夫应用科学研究促进协会 用于使用基于时域激励信号的错误隐藏提供经解码的音频信息的音频解码器及方法
CN103714820A (zh) * 2013-12-27 2014-04-09 广州华多网络科技有限公司 参数域的丢包隐藏方法及装置
CN107248411A (zh) * 2016-03-29 2017-10-13 华为技术有限公司 丢帧补偿处理方法和装置
WO2021050155A1 (en) * 2019-09-09 2021-03-18 Qualcomm Incorporated Artificial intelligence based audio coding
CN111554308A (zh) * 2020-05-15 2020-08-18 腾讯科技(深圳)有限公司 一种语音处理方法、装置、设备及存储介质
CN112489665A (zh) * 2020-11-11 2021-03-12 北京融讯科创技术有限公司 语音处理方法、装置以及电子设备
CN113763973A (zh) * 2021-04-30 2021-12-07 腾讯科技(深圳)有限公司 音频信号增强方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
JP2023553629A (ja) 2023-12-25
CN113763973A (zh) 2021-12-07
US20230099343A1 (en) 2023-03-30
EP4297025A1 (en) 2023-12-27

Similar Documents

Publication Publication Date Title
WO2022228144A1 (zh) 音频信号增强方法、装置、计算机设备、存储介质和计算机程序产品
CN111247585B (zh) 语音转换方法、装置、设备及存储介质
KR101246991B1 (ko) 오디오 신호 처리 방법
TWI484479B (zh) 用於低延遲聯合語音及音訊編碼中之錯誤隱藏之裝置和方法
TW200401532A (en) Distributed voice recognition system utilizing multistream network feature processing
US11594236B2 (en) Audio encoding/decoding based on an efficient representation of auto-regressive coefficients
WO2009055192A1 (en) Method and apparatus for generating an enhancement layer within an audio coding system
WO2010077542A1 (en) Method and apprataus for generating an enhancement layer within a multiple-channel audio coding system
US20220180881A1 (en) Speech signal encoding and decoding methods and apparatuses, electronic device, and storage medium
WO2021227749A1 (zh) 一种语音处理方法、装置、电子设备及计算机可读存储介质
US8762141B2 (en) Reduced-complexity vector indexing and de-indexing
US20230377584A1 (en) Real-time packet loss concealment using deep generative networks
JP3357795B2 (ja) 音声符号化方法および装置
CN111554322A (zh) 一种语音处理方法、装置、设备及存储介质
CN112908293B (zh) 一种基于语义注意力机制的多音字发音纠错方法及装置
CN111554323A (zh) 一种语音处理方法、装置、设备及存储介质
CN111554308A (zh) 一种语音处理方法、装置、设备及存储介质
JP2024516664A (ja) デコーダ
Li et al. A Two-stage Approach to Quality Restoration of Bone-conducted Speech
Benamirouche et al. Low complexity forward error correction for CELP-type speech coding over erasure channel transmission
KR102132326B1 (ko) 통신 시스템에서 오류 은닉 방법 및 장치
Huang et al. A Two-Stage Training Framework for Joint Speech Compression and Enhancement
Lan et al. Shortcut-Based Fully Convolutional Network for Speech Enhancement
EP3252763A1 (en) Low-delay audio coding
JPH11272298A (ja) 音声通信方法及び音声通信装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22794615

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023535590

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 2022794615

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022794615

Country of ref document: EP

Effective date: 20230920

NENP Non-entry into the national phase

Ref country code: DE