WO2022228144A1 - 音频信号增强方法、装置、计算机设备、存储介质和计算机程序产品 - Google Patents
音频信号增强方法、装置、计算机设备、存储介质和计算机程序产品 Download PDFInfo
- Publication number
- WO2022228144A1 WO2022228144A1 PCT/CN2022/086960 CN2022086960W WO2022228144A1 WO 2022228144 A1 WO2022228144 A1 WO 2022228144A1 CN 2022086960 W CN2022086960 W CN 2022086960W WO 2022228144 A1 WO2022228144 A1 WO 2022228144A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- filtering
- excitation signal
- long
- signal
- linear
- Prior art date
Links
- 230000005236 sound signal Effects 0.000 title claims abstract description 339
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000004590 computer program Methods 0.000 title claims description 30
- 238000001914 filtration Methods 0.000 claims abstract description 441
- 230000005284 excitation Effects 0.000 claims abstract description 401
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 82
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 82
- 238000012545 processing Methods 0.000 claims abstract description 71
- 238000012937 correction Methods 0.000 claims abstract description 32
- 230000007774 longterm Effects 0.000 claims description 188
- 230000005856 abnormality Effects 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 8
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 238000010586 diagram Methods 0.000 description 15
- 238000013528 artificial neural network Methods 0.000 description 9
- 230000002159 abnormal effect Effects 0.000 description 7
- 238000005070 sampling Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 230000004044 response Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 230000002708 enhancing effect Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000000354 decomposition reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000009499 grossing Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 230000008054 signal transmission Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000000214 mouth Anatomy 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 210000003437 trachea Anatomy 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0364—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/005—Correction of errors induced by the transmission channel, if related to the coding algorithm
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/08—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/08—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
- G10L19/09—Long term prediction, i.e. removing periodical redundancies, e.g. by using adaptive codebook or pitch predictor
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/26—Pre-filtering or post-filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L2019/0001—Codebooks
- G10L2019/0011—Long term prediction filters, i.e. pitch estimation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
Definitions
- the present application relates to the field of computer technology, and in particular, to an audio signal enhancement method, apparatus, computer equipment, storage medium and computer program product.
- quantization noise is usually introduced, which makes the decoded and synthesized speech distorted.
- pitch filter or post-processing technology based on neural network is usually used to enhance the audio signal, so as to reduce the influence of quantization noise on speech quality.
- an audio signal enhancement method for improving audio signal enhancement method, apparatus, computer device, storage medium, and computer program product.
- An audio signal enhancement method performed by computer equipment, the method comprising:
- the voice excitation signal of the filter is subjected to speech enhancement processing to obtain an enhanced speech excitation signal
- a speech enhancement signal is obtained by performing speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters.
- the linear filtering parameters include a linear filtering coefficient and an energy gain value; the linear prediction filter is parameterized based on the linear filtering parameters, and the enhancement is performed on the enhanced linear prediction filter through the parameter-configured linear prediction filter.
- the post-speech excitation signal is linearly synthesized and filtered, including:
- the energy adjustment is performed on the historical long-term filtering excitation signal corresponding to the historical voice packet through the energy adjustment parameter, so as to obtain the adjusted historical long-term filtering excitation signal;
- An audio signal enhancement device the device comprises:
- a voice packet processing module used for sequentially decoding the received voice packets to obtain residual signal, long-term filtering parameters and linear filtering parameters; filtering the residual signal to obtain an audio signal;
- a feature parameter extraction module for extracting feature parameters from the audio signal when the audio signal is a FEC frame signal
- a signal conversion module for converting the audio signal into a filter speech excitation signal based on the linear filtering parameter
- a speech enhancement module configured to perform speech enhancement processing on the filter speech excitation signal according to the characteristic parameter, the long-term filter parameter and the linear filter parameter to obtain an enhanced speech excitation signal;
- a speech synthesis module configured to perform speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters to obtain a speech enhancement signal.
- a computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:
- the audio signal is a forward error correction frame signal, extracting characteristic parameters from the audio signal;
- the voice excitation signal of the filter is subjected to speech enhancement processing to obtain an enhanced speech excitation signal
- a speech enhancement signal is obtained by performing speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters.
- the audio signal is a forward error correction frame signal, extracting characteristic parameters from the audio signal;
- the voice excitation signal of the filter is subjected to speech enhancement processing to obtain an enhanced speech excitation signal
- a speech enhancement signal is obtained by performing speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters.
- a computer program comprising computer instructions stored in a computer-readable storage medium from which a processor of a computer device reads the computer instructions, the processor Executing the computer instructions causes the computer device to perform the following steps:
- the audio signal is a forward error correction frame signal, extracting characteristic parameters from the audio signal;
- the voice excitation signal of the filter is subjected to speech enhancement processing to obtain an enhanced speech excitation signal
- a speech enhancement signal is obtained by performing speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters.
- FIG. 1 is a schematic diagram of a speech generation model based on an excitation signal in one embodiment
- Fig. 2 is the application environment diagram of the audio signal enhancement method in one embodiment
- FIG. 3 is a schematic flowchart of an audio signal enhancement method in one embodiment
- FIG. 4 is a schematic diagram of an audio signal transmission process flow in one embodiment
- 5 is an amplitude-frequency response diagram of a long-term prediction filter in one embodiment
- FIG. 6 is a schematic flowchart of a voice packet decoding and filtering step in one embodiment
- FIG. 7 is an amplitude-frequency response diagram of a long-time inverse filter in one embodiment
- FIG. 8 is a schematic diagram of a signal enhancement model in one embodiment
- FIG. 9 is a schematic flowchart of an audio signal enhancement method in another embodiment
- FIG. 10 is a schematic flowchart of an audio signal enhancement method in another embodiment
- FIG. 11 is a structural block diagram of an audio signal enhancement apparatus in one embodiment
- FIG. 12 is a structural block diagram of an audio signal enhancement apparatus in another embodiment
- FIG. 13 is an internal structure diagram of a computer device in one embodiment
- FIG. 14 is an internal structure diagram of a computer apparatus in another embodiment.
- the impact signal impacts the vocal cords of people, producing quasi-periodic opening and closing, and after being amplified by the oral cavity, a sound is emitted, and the emitted sound corresponds to the filter in the speech generation model based on the excitation signal.
- the filters in the speech generation model based on the excitation signal are subdivided into Long Term Prediction (LTP) filters and Linear Predictive Coding (LPC) filters.
- LTP Long Term Prediction
- LPC Linear Predictive Coding
- the LTP filter uses the long-term correlation of speech to enhance the audio signal
- the LPC filter uses the short-term correlation of the speech to strengthen the audio signal.
- the excitation signal will impact the LTP filter and the LPC filter separately; for aperiodic signals such as unvoiced sounds, the excitation signal will only impact the LPC filter.
- the solutions provided by the embodiments of the present application relate to technologies such as artificial intelligence machine learning, and are specifically described by the following embodiments:
- the audio signal enhancement method provided by the present application is executed by computer equipment, and can be specifically applied to the application environment shown in FIG. 2 . middle.
- the terminal 202 communicates with the server 204 through the network, the terminal 202 can receive the voice packets sent by the server 204, or the voice packets forwarded by other devices through the server 204, and the server 204 can receive the voice packets sent by the terminal, or other devices Sent voice packets.
- the above audio signal enhancement method can be applied to the terminal 202 or the server 204, and is described by taking the terminal 202 as an example.
- the terminal 202 sequentially decodes the received voice packets to obtain residual signals, long-term filtering parameters and linear filtering parameters. , filter the residual signal to obtain an audio signal; when the audio signal is a forward error correction frame signal, extract feature parameters from the audio signal; based on the linear filtering parameters, convert the audio signal into a filter speech excitation signal; parameters, long-term filtering parameters and linear filtering parameters, perform speech enhancement processing on the filter speech excitation signal to obtain an enhanced speech excitation signal; perform speech synthesis based on the enhanced speech excitation signal and linear filtering parameters to obtain a speech enhancement signal.
- the terminal 202 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 204 can be an independent physical server, or a server cluster or distributed server composed of multiple physical servers. It can also provide basic cloud services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. Cloud servers for computing services.
- an audio signal enhancement method is provided, and the method is applied to the computer equipment (terminal or server) in FIG. 2 as an example to illustrate, including the following steps:
- S302 Decode the received voice packets in sequence to obtain a residual signal, long-term filtering parameters and linear filtering parameters; filter the residual signal to obtain an audio signal.
- the received voice packets may be voice packets in an anti-packet loss scenario based on a forward error correction (Feedforward Error Correction, FEC) technology.
- FEC forward error correction
- Forward error correction technology is an error control method, which means that before the signal is sent into the transmission channel, it is encoded according to a certain algorithm in advance, and the redundant code with the characteristics of the signal itself is added. The received signal is decoded to find out the error code generated in the transmission process and correct it.
- Redundancy may also be referred to as redundant information.
- the signal transmitting end encodes the audio signal of the current voice frame (referred to as the current frame for short)
- the The audio signal information of one frame) is encoded into the voice packet corresponding to the current frame audio signal as redundant information, and after the encoding is completed, the voice packet corresponding to the current frame audio signal is sent to the receiving end, and the receiving end receives the voice packet,
- the audio signal of the next voice frame (referred to as the next frame for short) can be detected by the receiver.
- the corresponding voice packets are decoded, thereby obtaining audio signals corresponding to the lost or erroneous voice packets, thereby improving the reliability of signal transmission.
- the receiving end may be the terminal 202 in FIG. 2 .
- the terminal when the terminal receives the voice packet, it stores the received voice packet in the cache, then takes out the voice packet corresponding to the voice frame to be played from the cache, and decodes and filters the voice packet to obtain the audio signal.
- the voice packet when the voice packet is an adjacent packet of the historical voice packet decoded at the previous moment, and the historical voice packet decoded at the previous moment is not abnormal, then the obtained audio signal is directly output, or the audio signal is output.
- Perform audio signal enhancement processing to obtain a voice enhanced signal, and output the voice enhanced signal; when the voice packet is not an adjacent packet of the historical voice packet decoded at the previous moment, or the voice packet is the history decoded at the previous moment.
- the audio signal enhancement process is performed on the audio signal to obtain a voice enhanced signal, and the voice enhanced signal is output, wherein the voice enhanced signal It carries the audio signal corresponding to the adjacent packet of the historical voice packet decoded at the previous moment.
- the decoding can be entropy decoding, and entropy decoding is a decoding scheme corresponding to entropy encoding. Specifically, when the transmitting end encodes the audio signal, it can use the entropy encoding scheme to encode the audio signal to obtain a voice packet, so that the receiving end is receiving it. When arriving at the voice packet, an entropy decoding scheme can be used to decode the received voice packet.
- the terminal when receiving a voice packet, decodes the received voice packet to obtain a residual signal and filter parameters, and performs signal synthesis and filtering on the residual signal based on the filter parameters to obtain an audio signal.
- the filter parameters include long-term filter parameters and linear filter parameters.
- the transmitting end when encoding the audio signal of the current frame, obtains the filter parameters by analyzing the audio signal of the previous frame, and configures the parameters of the filter based on the obtained filter parameters, and then passes the configured filter parameters. Analyze and filter the audio signal of the current frame to obtain the residual signal of the audio signal of the current frame, and use the residual signal and the filter parameters obtained by the analysis to encode the audio signal to obtain a voice packet, and send the voice packet to the receiver. Therefore, after receiving the voice packet, the receiving end decodes the received voice packet to obtain a residual signal and filter parameters, and performs signal synthesis and filtering on the residual signal based on the filter parameters to obtain an audio signal.
- the filter parameters include linear filtering parameters and long-term filtering parameters.
- the transmitting end obtains the linear filtering parameters and the long-term filtering parameters by analyzing the audio signal of the previous frame, and then Perform linear analysis and filtering on the audio signal of the current frame based on the linear filtering parameters to obtain a linear filtering excitation signal, and then perform long-term analysis filtering on the linear filtering excitation signal based on the long-term filtering parameters to obtain the residual signal corresponding to the audio signal of the current frame, and use
- the residual signal, the linear filtering parameters obtained by the analysis and the long-term filtering parameters are used to encode the audio signal of the current frame to obtain a voice packet, and the voice packet is sent to the receiving end.
- performing linear analysis and filtering on the audio signal of the current frame based on the linear filtering parameters specifically includes: performing parameter configuration on the linear prediction filter based on the linear filtering parameters, and performing linear analysis and filtering on the audio signal of the linear prediction filter after the parameter configuration, to obtain a linear Filter the excitation signal, wherein the linear filter parameters include linear filter coefficients and energy gain values.
- the linear filter coefficients can be recorded as LPC AR, and the energy gain value can be recorded as LPC gain.
- the formula of the linear prediction filter is as follows:
- e(n) is the linear filter excitation signal corresponding to the audio signal of the current frame
- s(n) is the audio signal of the current frame
- p is the number of sampling points contained in each frame of audio signal
- a i is the analysis of the previous
- s adj (ni) is the energy-adjusted state of the previous frame audio signal s (ni) of the current frame audio signal s (n)
- s adj (ni) can be obtained by the following formula :
- s(ni) is the previous frame audio signal of the current frame audio signal s(n)
- gain adj is the energy adjustment parameter of the previous frame audio signal s(ni)
- gain adj can be obtained by the following formula:
- gain(n) is the energy gain value corresponding to the audio signal of the current frame
- gain(n-i) is the energy gain value corresponding to the audio signal of the previous frame.
- the long-term analysis and filtering of the linear filtering excitation signal based on the long-term filtering parameters specifically includes: performing parameter configuration on the long-term prediction filter based on the long-term filtering parameters, and performing a long-term analysis on the residual signal through the long-term prediction filter after the parameter configuration. Analyze and filter to obtain the corresponding residual signal of the audio signal of the current frame, wherein the long-term filtering parameters include the pitch period and the corresponding amplitude gain value, the pitch period can be recorded as LTP pitch, and the corresponding amplitude gain value can be recorded as LTP gain,
- the frequency domain representation of the long-term prediction filter is as follows, and the frequency domain can be denoted as the Z domain:
- p(z) is the amplitude-frequency response of the long-term prediction filter
- z is the twiddle factor of the frequency domain transformation
- ⁇ is the amplitude gain value LTP gain
- T is the pitch period LTP pitch
- the time domain representation of the long-term prediction filter is as follows:
- ⁇ (n) is the residual signal corresponding to the audio signal of the current frame
- e(n) is the linear filter excitation signal corresponding to the audio signal of the current frame
- ⁇ is the amplitude gain value LTP gain
- T is the pitch period LTP pitch
- e(n-T) is the linear filtering excitation signal corresponding to the audio signal of the previous pitch period of the audio signal of the current frame.
- the filter parameters decoded by the terminal include long-term filtering parameters and linear filtering parameters
- the signal synthesis filtering includes long-term synthesis filtering based on the long-term filtering parameters, and linear synthesis filtering based on the linear filtering parameters.
- the terminal after obtaining the residual signal, divides the obtained residual signal into multiple subframes to obtain multiple sub-residual signals. It performs long-term synthesis filtering to obtain a long-term filtered excitation signal corresponding to each subframe, and then combines the long-term filtered excitation signal corresponding to each subframe according to the timing of each subframe to obtain a corresponding long-term filtered excitation signal.
- the residual signal can be divided into 4 subframes to obtain 4 sub-residual signals of 5ms.
- the difference signal is subjected to long-term synthesis filtering based on the corresponding long-term filtering parameters, and four 5ms long-term filtering excitation signals are obtained, and then the four 5ms long-term filtering excitation signals are combined according to the timing of each subframe. , a 20ms long-term filtered excitation signal is obtained.
- the terminal divides the obtained long-term filtered excitation signal into multiple subframes to obtain multiple sub-long-term filtered excitation signals, and then for each sub-long-term filtered excitation signal, Perform linear synthesis filtering based on the corresponding linear filtering parameters to obtain the sub-linear filtering excitation signal corresponding to each sub-frame, and then combine the linear filtering excitation signal corresponding to each sub-frame according to the timing of each sub-frame to obtain the corresponding Linearly filtered excitation signal.
- the long-term filter excitation signal can be divided into two subframes, and two 10ms sub-long-term filter excitation signals can be obtained.
- Each 10ms sub-long-time filter excitation signal is linearly synthesized and filtered based on the corresponding linear filtering parameters to obtain two 10ms sub-audio signals, and then the two 10ms sub-audio signals are based on the timing of each subframe. Combined to get a 20ms audio signal.
- the audio signal is a forward error correction frame signal, which means that the audio signal of the historical adjacent frame of the audio signal is abnormal, and the abnormality of the audio signal of the historical adjacent frame specifically includes: the audio signal of the historical adjacent frame is not received.
- the characteristic parameters include cepstral characteristic parameters.
- the terminal determines whether the historical voice packet decoded before decoding the voice packet has data abnormality, if the decoded historical voice packet has data abnormality If it is abnormal, it is determined that the currently decoded and filtered audio signal is a forward error correction frame signal.
- the terminal determines whether the historical audio signal corresponding to the historical voice packet decoded at the moment before decoding the voice packet is the audio signal of the previous frame of the audio signal obtained by decoding the voice packet, and if so, determines the historical voice packet There is no data abnormality, if not, it is determined that the historical voice packet has data abnormality.
- the terminal determines whether the audio signal currently decoded and filtered is a forward error correction frame signal by determining whether the historical voice packet decoded before decoding the current voice packet is abnormal in data, and then the audio signal can be When the signal is a forward error correction frame signal, audio signal enhancement processing is performed on it to further improve the quality of the audio signal.
- a feature parameter is extracted from the decoded audio signal, and the extracted feature parameter may specifically be a cepstrum feature parameter, which specifically includes the following steps : Perform Fourier transform on the audio signal to obtain the audio signal after Fourier transform; perform logarithmic processing on the audio signal after Fourier transform to obtain the logarithmic result; perform Fourier inverse on the obtained logarithmic result Transform to get the cepstral feature parameters.
- the cepstral feature parameters can be extracted from the audio signal by the following formula:
- C(n) is the cepstral characteristic parameter of the audio signal S(n) obtained after decoding and filtering
- S(F) is the Fourier transform obtained by Fourier transforming the audio signal S(n). audio signal.
- the terminal can enhance the audio signal based on the extracted cepstrum characteristic parameter by extracting the cepstrum characteristic parameter from the audio signal, thereby improving the quality of the audio signal.
- the audio signal obtained after decoding and filtering can also be obtained from the current decoding and filtering.
- Feature parameters are extracted to perform audio signal enhancement processing on the currently decoded and filtered audio signal.
- the terminal can also obtain the linear filtering parameters obtained when decoding the voice packet, and perform linear analysis and filtering on the obtained audio signal based on the linear filtering parameters, so as to realize the audio
- the signal is converted into a filter speech excitation signal.
- S306 specifically includes the following steps: performing parameter configuration on the linear prediction filter based on the linear filtering parameters, and linearly decomposing and filtering the audio signal by using the linear prediction filter configured by the parameters to obtain the filter speech excitation signal.
- the linear decomposition filtering is also called linear analysis filtering.
- linear analysis filtering when linear analysis filtering is performed on the audio signal, the linear analysis filtering is directly performed on the audio signal of the entire frame, and there is no need to perform molecular frame processing on the audio signal of the entire frame. .
- the terminal can use the following formula to linearly decompose and filter the audio signal to obtain the filter speech excitation signal:
- D(n) is the filter speech excitation signal corresponding to the audio signal S(n) obtained after decoding and filtering the speech packet
- S(n) is the audio signal obtained after decoding and filtering the speech packet
- S adj (ni) is the energy-adjusted state of the audio signal S(ni) of the previous frame of the obtained audio signal S(n)
- p is the number of sampling points included in each frame of audio signal
- a i is the number of sampling points included in the decoded voice packet The resulting linear filter coefficients.
- the terminal converts the audio signal into the filter speech excitation signal based on the linear filtering parameters, so that the audio signal can be enhanced by enhancing the filter speech excitation signal, thereby improving the quality of the audio signal.
- the long-term filtering parameters include pitch period and amplitude gain value.
- S308 includes the following steps: performing speech enhancement processing on the filter speech excitation signal according to the pitch period, the amplitude gain value, the linear filter parameter and the cepstral characteristic parameter to obtain an enhanced speech excitation signal.
- the speech enhancement processing of the audio signal can be implemented by a pre-trained signal enhancement model, and the signal enhancement model is a neural network (Neural Network, NN) model, and the neural network model can specifically adopt the structure of LSTM and CNN.
- the signal enhancement model is a neural network (Neural Network, NN) model
- the neural network model can specifically adopt the structure of LSTM and CNN.
- the terminal performs speech enhancement processing on the filter speech excitation signal according to the pitch period, the amplitude gain value, the linear filter parameter and the cepstral characteristic parameter to obtain the enhanced speech excitation signal, which can then be realized based on the enhanced speech excitation signal.
- the enhancement of the audio signal improves the quality of the audio signal.
- the terminal inputs the obtained characteristic parameters, long-term filtering parameters, linear filtering parameters, and the filter speech excitation signal into a pre-trained signal enhancement model, so that the signal enhancement model performs the filter speech excitation signal based on the characteristic parameters.
- the voice enhancement processing is performed to obtain the enhanced voice excitation signal.
- the terminal realizes the enhancement of the voice excitation signal after the enhancement through the pre-trained signal enhancement model, and then can realize the enhancement of the audio signal based on the enhanced voice excitation signal, which improves the quality of the audio signal and the efficiency of the audio signal enhancement processing.
- the speech enhancement processing is performed on the entire frame of the filter speech excitation signal, and there is no need to perform the speech enhancement processing on the filter speech excitation signal.
- the whole frame of the filter speech excitation signal is subjected to molecular frame processing.
- S310 Perform speech synthesis based on the enhanced speech excitation signal and the linear filtering parameters to obtain a speech enhancement signal.
- the speech synthesis may be linear synthesis filtering based on linear filtering parameters.
- the terminal after obtaining the enhanced speech excitation signal, performs parameter configuration on the linear prediction filter based on the linear filtering parameters, and performs linear synthesis filtering on the enhanced speech excitation signal by using the parameter-configured linear prediction filter, to obtain Speech enhancement signal.
- the linear filtering parameters include linear filtering coefficients and energy gain values
- the linear filtering coefficients can be recorded as LPC AR
- the energy gain value can be recorded as LPC gain
- the linear synthesis filtering is the linear analysis filtering performed by the transmitting end when encoding the audio signal. Therefore, the linear prediction filter that performs linear synthesis filtering is also called the linear inverse filter, and the time domain representation of the linear prediction filter is as follows:
- Senh (n) is the speech enhancement signal
- D enh (n) is the enhanced speech excitation signal obtained by performing the speech enhancement processing on the filter speech excitation signal D(n)
- Sadj (ni) is the obtained audio signal
- S(ni) of the previous frame of S(n) is the number of sampling points contained in each frame of audio signal
- a i is the linear filter coefficient obtained by decoding the voice packet.
- the energy-adjusted state of the audio signal S(ni) of the previous frame of the audio signal S(n), S adj (ni) can be obtained by the following formula:
- S adj (ni) is the energy adjusted state of the audio signal S(ni) of the previous frame
- gain adj is the energy adjustment parameter of the audio signal S(ni) of the previous frame
- the terminal can obtain a speech enhancement signal by performing linear synthesis filtering on the enhanced speech excitation signal, that is, the enhancement processing of the audio signal is realized, and the quality of the audio signal is improved.
- speech synthesis is performed on the entire frame of the enhanced speech excitation signal, and there is no need to perform molecular frame processing on the whole frame of the enhanced speech excitation signal.
- the terminal when a terminal receives a voice packet, the terminal sequentially decodes and filters the voice packet to obtain an audio signal, and when the audio signal is a forward error correction frame signal, extracts characteristic parameters from the audio signal, based on the decoding
- the linear filter coefficient obtained from the voice packet converts the audio signal into a filter voice excitation signal, so as to perform voice enhancement processing on the filter voice excitation signal according to the characteristic parameters and the long-term filtering parameters obtained by decoding the voice packet, and obtain the enhanced voice excitation signal.
- speech synthesis is performed to obtain a speech enhancement signal, so that the enhancement processing of the audio signal can be completed in less time, and a better signal enhancement effect can be achieved, which improves the audio frequency. Timeliness of signal enhancement.
- S302 specifically includes the following steps:
- S602 Perform parameter configuration on the long-term prediction filter based on the long-term filtering parameters, and perform long-term synthesis filtering on the residual signal through the long-term prediction filter configured by the parameters to obtain a long-term filtering excitation signal.
- the long-term filtering parameters include the pitch period and the corresponding amplitude gain value.
- the pitch period can be recorded as LTP pitch, and the LTP pitch can also be called the pitch period.
- the corresponding amplitude gain value can be recorded as LTP gain.
- the long-term prediction filter performs long-term synthesis filtering on the residual signal, wherein the long-term synthesis filtering is the inverse process of the long-term analysis filtering performed when the transmitting end encodes the audio signal, so the long-term prediction of the long-term synthesis filtering is performed.
- the filter is also called a long-term inverse filter, that is, a long-term inverse filter is used to process the residual signal.
- the frequency domain representation of the long-term inverse filter corresponding to formula (1) is as follows:
- p -1 (z) is the amplitude-frequency response of the long-time inverse filter
- z is the rotation factor of the frequency domain transformation
- ⁇ is the amplitude gain value LTP gain
- T is the pitch period LTP pitch.
- E(n) is the long-term filter excitation signal corresponding to the voice packet
- ⁇ (n) is the residual signal corresponding to the voice packet
- ⁇ is the amplitude gain value LTP gain
- T is the pitch period LTP pitch
- E( n-T) is the long-term filtering excitation signal corresponding to the audio signal of the previous pitch period of the voice packet.
- the long-term filtering excitation signal E(n) obtained by performing long-term synthesis filtering on the residual signal at the receiving end through the long-term inverse filter is encoded with the transmitting end through the linear filter.
- the linear filtering excitation signal e(n) obtained by performing linear analysis filtering on the audio signal is the same.
- S604 perform parameter configuration on the linear prediction filter based on the linear filtering parameters, and perform linear synthesis filtering on the long-term filtering excitation signal through the linear prediction filter after the parameter configuration, to obtain an audio signal.
- the linear filtering parameters include linear filtering coefficients and energy gain values
- the linear filtering coefficients can be recorded as LPC AR
- the energy gain value can be recorded as LPC gain
- the linear synthesis filtering is the linear analysis filtering performed by the transmitting end when encoding the audio signal. Therefore, the linear prediction filter that performs linear synthesis filtering is also called the linear inverse filter, and the time domain representation of the linear prediction filter is as follows:
- S(n) is the corresponding audio signal of the voice packet
- E(n) is the corresponding long-term filter excitation signal of the voice packet
- S adj (ni) is the previous one to obtain the audio signal S(n).
- p is the number of sampling points included in each frame of audio signal
- a i is the linear filter coefficient obtained by decoding the voice packet.
- the energy-adjusted state of the audio signal S(ni) of the previous frame of the audio signal S(n), S adj (ni) can be obtained by the following formula:
- gain adj is the energy adjustment parameter of the audio signal S(ni) of the previous frame
- gain(n) is the energy gain value obtained by decoding the voice packet
- gain(ni) is the energy gain value corresponding to the audio signal of the previous frame.
- the terminal performs long-term synthesis filtering on the residual signal based on the long-term filtering parameters to obtain the long-term filtering excitation signal; and performs linear synthesis filtering on the long-term filtering excitation signal based on the linear filtering parameters obtained by decoding to obtain the audio signal, Therefore, when the audio signal is not a forward error correction frame signal, the audio signal can be directly output, and when the audio signal is a forward error correction frame signal, the audio signal is enhanced and output, which improves the timeliness of the audio signal output.
- S604 specifically includes the following steps: dividing the long-term filtering excitation signal into at least two subframes to obtain sub-long-term filtering excitation signals; grouping the linear filtering parameters obtained by decoding to obtain at least two linear filtering parameters set; respectively perform parameter configuration on at least two linear prediction filters based on the linear filtering parameter set; input the obtained sub-long-term filtering excitation signals into the linear prediction filters after parameter configuration respectively, so that the linear prediction filters are based on the linear filtering parameters
- the set performs linear synthesis filtering on the sub-long-time filtered excitation signal to obtain sub-audio signals corresponding to each sub-frame; and combines the sub-audio signals according to the timing of each sub-frame to obtain an audio signal.
- the linear filter parameter set has two types: linear filter coefficient set and energy gain value set.
- S(n) in formula (12) is any sub- The sub-audio signal corresponding to the frame
- E(n) is the long-term filter excitation signal corresponding to the sub-frame
- Sadj (ni) is the sub-audio signal S( ni) state after energy adjustment
- p is the number of sampling points included in each subframe audio signal
- a i is the set of linear filter coefficients corresponding to the subframe
- gain adj in formula (13) is the sub audio signal
- the energy adjustment parameter of the sub-audio signal of the previous subframe gain(n) is the energy gain value of the sub-audio signal
- gain(ni) is the energy gain value of the sub-audio signal of the previous subframe of the sub-audio signal.
- the terminal obtains the sub-long-term filtering excitation signal by dividing the long-term filtering excitation signal into at least two subframes; grouping the linear filtering parameters obtained by decoding to obtain at least two linear filtering parameter sets; based on the linear filtering
- the parameter set respectively configures the parameters of at least two linear prediction filters; the obtained sub-long-term filtering excitation signals are respectively input into the linear prediction filters after the parameter configuration, so that the linear prediction filter is based on the linear filtering parameter set.
- the audio signal sent by the terminal improves the quality of the restored audio signal.
- the linear filtering parameters include a linear filtering coefficient and an energy gain value; S604 further includes the following steps: For the sub-long-term filtering excitation signal corresponding to the first subframe in the long-term filtering excitation signal, obtain a historical long-term filtering excitation signal The energy gain value of the historical sub-long-term filtering excitation signal of the sub-frame adjacent to the sub-long-term filtering excitation signal corresponding to the first sub-frame in the excitation signal; based on the energy gain value corresponding to the historical sub-long-term filtering excitation signal and the first sub-frame The energy gain value of the sub-long-term filtering excitation signal corresponding to the sub-frame is used to determine the energy adjustment parameters corresponding to the sub-long-term filtering excitation signal; the energy adjustment parameters are used to adjust the energy of the historical sub-long-term filtering excitation signal to obtain the energy-adjusted history Sub-long-time filtered excitation signal.
- the historical long-term filtering excitation signal is the long-term filtering excitation signal of the previous frame of the long-term filtering excitation signal of the current frame, and the sub-long-term filtering excitation signal adjacent to the sub-long-term filtering excitation signal corresponding to the first subframe in the historical long-term filtering excitation signal
- the historical sub-long-term filtering excitation signal of the frame is the sub-long-term filtering excitation signal corresponding to the last sub-frame of the long-term filtering excitation signal of the previous frame.
- the long-term filtering excitation signal of the current frame is divided into two subframes, and the sub-long-term filtering excitation signal corresponding to the first subframe and the sub-long-term filtering excitation signal corresponding to the second subframe are obtained.
- the sub-long-term filtered excitation signal corresponding to the second sub-frame of the filtered excitation signal, and the sub-long-term filtered excitation signal corresponding to the first sub-frame of the current frame is an adjacent sub-frame.
- the terminal after obtaining the energy-adjusted historical sub-long-term filtering excitation signal, the terminal inputs the obtained sub-long-term filtering excitation signal and the energy-adjusted historical sub-long-term filtering excitation signal into the parameter-configured A linear prediction filter, so that the linear prediction filter performs linear synthesis filtering on the sub-long-term filtering excitation signal corresponding to the first subframe based on the historical sub-long-term filtering excitation signal obtained after the linear filter coefficient and energy adjustment to obtain the first sub-frame.
- the sub-audio signal corresponding to the sub-frame.
- a voice packet corresponds to an audio signal of 20ms, that is, the long-term filter excitation signal obtained is 20ms
- the AR coefficient obtained by decoding the voice packet is ⁇ A 1 ,A 2 ,...,A p-1 ,A p ,A p+1 ,...A 2p-1 , A 2p ⁇
- the energy gain value obtained by decoding the speech packet is ⁇ gain 1 (n), gain 2 (n) ⁇
- the long-term filtering excitation signal can be divided into two subsections frame, obtain the first sub-filter excitation signal E 1 (n) corresponding to the first 10ms and the second sub-filter excitation signal E 2 (n) corresponding to the last 10ms, and group the AR coefficients to obtain an AR coefficient set 1 ⁇ A 1 ,A 2 ,...,A p-1 ,A p ⁇ and the AR coefficient set 2 ⁇ A p+1 ,...A 2p-1 ,A 2p ⁇ , group the energy gain values, and obtain the energy gain value set 1 ⁇
- the set of energy gain values of the previous subframe of a sub-filtered excitation signal E 1 (n) is ⁇ gain 2 (ni) ⁇
- the sub-filtered excitation signal of the previous sub-frame of the second sub-filtered excitation signal E 2 (n) is E 1 (n)
- the energy gain value set of the previous subframe of the second sub-filtered excitation signal E 2 (n) is ⁇ gain 1 (n) ⁇
- the first sub-filtered excitation signal E 1 (n) corresponds to
- the sub-audio signal can be obtained by substituting the corresponding parameters into formula (12) and formula (13)
- the sub-audio signal corresponding to the second sub-filtering excitation signal E 2 (n) can be obtained by substituting the corresponding parameters into formula (12) and formula (13) ) to obtain.
- the terminal obtains the adjacent sub-long-term filtering excitation signal corresponding to the first sub-frame in the historical long-term filtering excitation signal for the sub-long-term filtering excitation signal corresponding to the first sub-frame in the long-term filtering excitation signal.
- the characteristic parameter includes a cepstral characteristic parameter
- S308 includes the following steps: performing vectorization processing on the cepstral characteristic parameter, the long-term filtering parameter and the linear filtering parameter, and splicing the result obtained by the vectorization processing to obtain a characteristic vector; Input the feature vector and the filter speech excitation signal into the pre-trained signal enhancement model; perform feature extraction on the feature vector through the signal enhancement model to obtain the target feature vector; based on the target feature vector, the filter speech excitation signal is enhanced, and the enhanced Voice excitation signal.
- the signal enhancement model is a multi-level network structure, which specifically includes a first feature splicing layer, a second feature splicing layer, a first neural network layer and a second neural network layer.
- the target feature vector is the enhanced feature vector.
- the terminal performs vectorization processing on the cepstrum feature parameters, long-term filtering parameters and linear filtering parameters through the first feature splicing layer of the signal enhancement model, and splices the results obtained by the vectorization processing to obtain a feature vector, and then uses the obtained feature
- the vector is input to the first neural network layer of the signal enhancement model, and the feature vector is extracted through the first neural network layer to obtain the primary feature vector, and the primary feature vector and the linear filter coefficients in the linear filter parameters are Fourier transform
- the obtained envelope information is transformed, the second feature splicing layer of the input signal enhancement model, the primary feature vector after splicing, and the primary feature vector after splicing is input into the second neural network layer of the signal enhancement model.
- the primary eigenvectors are extracted to obtain the target eigenvector, and then the filter speech excitation signal is enhanced based on the target eigenvector to obtain the enhanced speech excitation signal.
- the terminal obtains a feature vector by performing vectorization processing on the cepstrum feature parameters, long-term filtering parameters and linear filtering parameters, and splicing the results obtained by the vectorized processing;
- the signal enhancement model is based on the signal enhancement model; the feature vector is extracted by the signal enhancement model to obtain the target feature vector; based on the target feature vector, the filter speech excitation signal is enhanced, and the enhanced speech excitation signal is obtained, so that the signal enhancement model can be used.
- the enhancement processing of the audio signal improves the quality of the audio signal and the efficiency of the enhancement processing for the audio signal.
- the terminal performs enhancement processing on the filter speech excitation signal based on the target feature vector to obtain the enhanced speech excitation signal, including: performing Fourier transform on the filter speech excitation signal to obtain the frequency domain speech excitation signal;
- the target feature vector enhances the amplitude feature of the frequency domain speech excitation signal; the inverse Fourier transforms the frequency domain speech excitation signal with the enhanced amplitude feature to obtain the enhanced speech excitation signal.
- the terminal after performing Fourier transform on the filter speech excitation signal, the terminal obtains the frequency domain speech excitation signal, and after enhancing the amplitude feature of the frequency domain speech excitation signal based on the target feature vector, combined with the unenhanced frequency domain
- the phase characteristic of the speech excitation signal is inverse Fourier transform of the frequency domain speech excitation signal with the enhanced amplitude characteristic to obtain the enhanced speech excitation signal.
- the two feature splicing layers are concat1 and concat2 respectively, and the two neural network layers are NN part1 and NN part2 respectively.
- the cepstrum feature parameter Cepstrum of dimension 40 and the pitch period LTP of dimension 1 are combined by concat1.
- the pitch and the amplitude gain value LTP Gain of dimension 1 are spliced together to form a feature vector of dimension 42, and the feature vector of dimension 42 is input into NN part1, which consists of a two-layer convolutional neural network and two It consists of layers of fully connected network.
- the dimension of the convolution kernel of the first layer is (1, 128, 3, 1)
- the dimension of the convolution kernel of the second layer is (128, 128, 3, 1)
- the number of nodes of the fully connected layer is 128 and 8
- the activation function at the end of each layer is the Tanh function
- the high-level features are extracted from the feature vector through NN part1 to obtain the primary feature vector of dimension 1024, and then the primary feature vector of dimension 1024 is concat2.
- the envelope information Envelope with dimension 161 obtained by the Fourier transform of the linear filter coefficient LPC AR in the splicing obtains the spliced primary feature vector with dimension 1185, and the spliced primary feature vector with dimension 1185 is input into NN part2, NN part 2 is a two-layer fully connected network with 256 and 161 nodes respectively.
- the activation function at the end of each layer is the Tanh function.
- the target feature vector is obtained through NN part 2, and then based on the target feature vector, the filter voice is excited.
- the amplitude characteristic Excitation of the frequency domain speech excitation signal obtained after the Fourier transform of the signal is enhanced, and the inverse Fourier transform is performed on the filter speech excitation signal of the enhanced amplitude characteristic Excitation to obtain the enhanced speech excitation signal D enh (n ).
- the terminal obtains the frequency domain speech excitation signal by performing Fourier transform on the filter speech excitation signal; based on the target feature vector, the amplitude feature of the frequency domain speech excitation signal is enhanced; The inverse Fourier transform of the domain speech excitation signal is used to obtain the enhanced speech excitation signal, so that the audio signal can be enhanced while the phase information of the audio signal is kept unchanged, and the quality of the audio signal can be improved.
- the linear filtering parameters include a linear filtering coefficient and an energy gain value; the terminal performs parameter configuration on the linear prediction filter based on the linear filtering parameters, and linearly synthesizes the enhanced speech excitation signal through the linear prediction filter after the parameter configuration.
- the filtering step includes: performing parameter configuration on the linear prediction filter based on the linear filter coefficient; obtaining the energy gain value corresponding to the historical voice packet decoded before decoding the voice packet; based on the energy gain value corresponding to the historical voice packet and the corresponding energy gain value of the voice packet
- the energy gain value is used to determine the energy adjustment parameter; the energy adjustment is performed on the historical long-term filtering excitation signal corresponding to the historical voice packet through the energy adjustment parameter, and the adjusted historical long-term filtering excitation signal is obtained; the adjusted historical long-term filtering excitation signal and enhancement
- the post-speech excitation signal is input to the parameter-configured linear prediction filter, so that the linear prediction filter performs linear synthesis filtering on the enhanced speech excitation signal based on the adjusted historical long-term filtering excitation signal.
- the historical audio signal corresponding to the historical voice packet is the previous frame audio signal of the current frame audio signal corresponding to the current voice packet.
- the energy gain value corresponding to the historical voice packet may be the energy gain value corresponding to the audio signal of the whole frame of the historical voice, or may be the energy gain value corresponding to the partial subframe audio signal of the historical voice packet.
- the audio signal is not a forward error correction frame signal, that is, the audio signal of the previous frame of the audio signal of the current frame is obtained after the terminal has normally decoded the historical voice packets, it can be obtained when the terminal decodes the historical voice packets.
- the obtained energy gain value of the historical voice packet and determine the energy adjustment parameter based on the energy gain value of the historical voice packet; when the audio signal is a forward error correction frame, that is, the previous frame audio signal of the current frame audio signal fails to pass
- the terminal normally decodes the historical voice packets, then determines the compensated energy gain value corresponding to the previous frame of audio signal based on the preset energy gain compensation mechanism, and determines the compensated energy gain value as the energy gain value of the historical voice packets , to determine the energy adjustment parameter based on the energy gain value of the historical voice packets.
- the energy adjustment parameter gain adj of the previous frame audio signal S(ni) can be calculated by the following formula:
- gain adj is the energy adjustment parameter of the audio signal S(ni) of the previous frame
- gain(ni) is the energy gain value of the audio signal S(ni) of the previous frame
- gain(n) is the energy gain value of the audio signal of the current frame
- the formula (14) is to calculate the energy adjustment parameter based on the energy gain value corresponding to the whole frame of audio signal of the historical speech.
- the energy adjustment parameter gain adj of the previous frame audio signal S(ni) can be obtained by the following formula:
- gain adj is the energy adjustment parameter of the previous frame of audio signal S(ni)
- gain m (ni) the energy gain value of the mth subframe of the previous frame of audio signal S(ni)
- gain m (n) is the energy gain value of the mth subframe of the audio signal of the current frame
- m is the number of subframes corresponding to each audio signal
- ⁇ gain 1 (n)+...+gain(n) ⁇ /m is the energy of the audio signal of the current frame gain value.
- the formula (15) is to calculate the energy adjustment parameter based on the energy gain value corresponding to the partial subframe audio signal of the historical speech.
- the terminal performs parameter configuration on the linear prediction filter based on the linear filter coefficient; obtains the energy gain value corresponding to the historical voice packet decoded before decoding the voice packet; based on the energy gain value corresponding to the historical voice packet and the voice packet The corresponding energy gain value is used to determine the energy adjustment parameter; the energy adjustment is performed on the historical long-term filtering excitation signal corresponding to the historical voice packet through the energy adjustment parameter, and the adjusted historical long-term filtering excitation signal is obtained; the adjusted historical long-term filtering excitation signal is and the enhanced speech excitation signal are input to the linear prediction filter after parameter configuration, so that the linear prediction filter based on the adjusted historical long-term filtering excitation signal, performs linear synthesis filtering on the enhanced speech excitation signal, thereby smoothing different frames
- the audio signal of different frames improves the quality of the speech composed of audio signals of different frames.
- a method for enhancing an audio signal is provided, and the method is applied to the computer equipment (terminal or server) in FIG. 2 as an example for description, including the following steps:
- S902 Decode the voice packet to obtain a residual signal, long-term filtering parameters and linear filtering parameters.
- S904 perform parameter configuration on the long-term prediction filter based on the long-term filtering parameters, and perform long-term synthesis filtering on the residual signal through the long-term prediction filter after the parameter configuration, to obtain a long-term filter excitation signal.
- S906 Divide the long-term filtered excitation signal into at least two subframes to obtain sub-long-term filtered excitation signals.
- S908 Group the de-linear filtering parameters to obtain at least two sets of linear filtering parameters.
- S910 Perform parameter configuration on at least two linear prediction filters respectively based on the linear filtering parameter set.
- S914 combine the sub-audio signals according to the time sequence of each sub-frame to obtain an audio signal.
- S916 Determine whether data abnormality occurs in the historical voice packets decoded before the voice packets are decoded.
- S922 Perform parameter configuration on the linear prediction filter based on the linear filtering parameters, and linearly decompose and filter the audio signal through the linear prediction filter after the parameter configuration, to obtain a filter speech excitation signal.
- S926 Perform parameter configuration on the linear prediction filter based on the linear filtering parameters, and perform linear synthesis filtering on the enhanced speech excitation signal through the linear prediction filter configured by the parameters to obtain a speech enhancement signal.
- the present application also provides an application scenario where the above-mentioned audio signal enhancement method is applied.
- the application of the audio signal enhancement method in this application scenario is as follows:
- the terminal after receiving a voice packet corresponding to a frame of audio signal, the terminal performs entropy decoding on the voice packet to obtain ⁇ (n), LTP pitch, LTP gain, LPC AR and LPC gain, based on LTP pitch and LTP gain Perform LTP synthesis filtering on ⁇ (n) to obtain E(n), perform LPC synthesis filtering on each subframe of E(n) based on LPC AR and LPC gain, and combine the LPC synthesis filtering results to obtain a frame S(n ), and then perform cepstral analysis on S(n) to obtain C(n), and perform LPC decomposition and filtering on S(n) of the entire frame based on LPC AR and LPC gain to obtain the entire frame D(n), and the LTP pitch , LTP gain, the envelope information after LPC AR Fourier transform, C(n) and D(n) are input to the pre-trained signal enhancement model NN postfilter, and the whole frame D(n) is enhanced by NN postfilter to get
- FIGS. 3 , 4 , 6 , 9 and 10 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIG. 3, FIG. 4, FIG. 6, FIG. 9 and FIG. 10 may include multiple steps or multiple stages, and these steps or stages are not necessarily executed and completed at the same time, but may be performed at different times. The execution sequence of these steps or stages is not necessarily carried out sequentially, but may be executed in turn or alternately with other steps or at least a part of the steps or stages in the other steps.
- an audio signal enhancement apparatus is provided, and the apparatus can use software modules or hardware modules, or a combination of the two to become a part of computer equipment, and the apparatus specifically includes: a voice packet The processing module 1102, the feature parameter extraction module 1104, the signal conversion module 1106, the speech enhancement module 1108 and the speech synthesis module 1110, wherein:
- the voice packet processing module 1102 is used for sequentially decoding and filtering the received voice packets to obtain residual signals, long-term filtering parameters and linear filtering parameters; filtering the residual signals to obtain audio signals.
- the feature parameter extraction module 1104 is configured to extract feature parameters from the audio signal when the audio signal is a forward error correction frame signal.
- the signal conversion module 1106 is configured to convert the audio signal into a filter speech excitation signal based on the linear filtering parameters.
- the speech enhancement module 1108 is configured to perform speech enhancement processing on the filter speech excitation signal according to the characteristic parameter, the long-term filter parameter and the linear filter parameter to obtain an enhanced speech excitation signal.
- the speech synthesis module 1110 is configured to perform speech synthesis based on the enhanced speech excitation signal and linear filtering parameters to obtain a speech enhancement signal.
- the computer equipment obtains the residual signal, the long-term filtering parameter and the linear filtering parameter by sequentially decoding the received voice packets, and filters the residual signal to obtain the audio signal, and the audio signal is before the audio signal.
- the error correction frame signal is used, the feature parameters are extracted from the audio signal, and the audio signal is converted into a filter voice excitation signal based on the linear filter coefficients obtained by decoding the voice packets, so as to obtain the long-term filtering parameters according to the feature parameters and the decoded voice packets.
- the speech packet processing module 1102 is further configured to: perform parameter configuration on the long-term prediction filter based on the long-term filtering parameters, and perform long-term synthesis filtering on the residual signal through the parameter-configured long-term prediction filter , obtain the long-term filtering excitation signal; perform parameter configuration on the linear prediction filter based on the linear filtering parameters, and perform linear synthesis filtering on the long-term filtering excitation signal through the linear prediction filter after the parameter configuration to obtain the audio signal.
- the terminal performs long-term synthesis filtering on the residual signal based on the long-term filtering parameters to obtain the long-term filtering excitation signal; and performs linear synthesis filtering on the long-term filtering excitation signal based on the linear filtering parameters obtained by decoding to obtain the audio signal, Therefore, when the audio signal is not a forward error correction frame signal, the audio signal can be directly output, and when the audio signal is a forward error correction frame signal, the audio signal is enhanced and output, which improves the timeliness of the audio signal output.
- the voice packet processing module 1102 is further configured to: divide the long-term filtering excitation signal into at least two subframes to obtain sub-long-term filtering excitation signals; group the linear filtering parameters to obtain at least two linear filtering parameters parameter set; respectively perform parameter configuration on at least two linear prediction filters based on the linear filtering parameter set; input the obtained sub-long-term filtering excitation signals into the linear prediction filters after parameter configuration respectively, so that the linear prediction filters are based on linear filtering
- the parameter set performs linear synthesis filtering on the sub-long-time filtered excitation signal to obtain sub-audio signals corresponding to each sub-frame; and combines the sub-audio signals according to the timing of each sub-frame to obtain an audio signal.
- the terminal obtains the sub-long-term filtering excitation signal by dividing the long-term filtering excitation signal into at least two subframes; grouping the linear filtering parameters to obtain at least two linear filtering parameter sets; Parameter configuration is performed on at least two linear prediction filters; the obtained sub-long-term filtering excitation signals are respectively input into the linear prediction filters after the parameter configuration, so that the linear prediction filter is based on the linear filtering parameter set.
- Perform linear synthesis filtering to obtain sub-audio signals corresponding to each sub-frame; combine the sub-audio signals according to the timing of each sub-frame to obtain an audio signal, so as to ensure that the obtained audio signal can be better restored to the sending end. the audio signal, which improves the quality of the restored audio signal.
- the linear filtering parameters include a linear filtering coefficient and an energy gain value; the voice packet processing module 1102 is further configured to: obtain the sub-long-term filtering excitation signal corresponding to the first sub-frame in the long-term filtering excitation signal, obtaining The energy gain value corresponding to the historical sub-long-term filtering excitation signal of the subframe adjacent to the sub-long-term filtering excitation signal corresponding to the first subframe in the historical long-term filtering excitation signal; based on the energy corresponding to the historical sub-long-term filtering excitation signal The gain value and the energy gain value of the sub-long-term filtering excitation signal corresponding to the first subframe, determine the energy adjustment parameter corresponding to the sub-long-term filtering excitation signal; perform energy adjustment on the historical sub-long-term filtering excitation signal through the energy adjustment parameter; The obtained sub-long-term filtering excitation signal and the energy-adjusted historical sub-long-term filtering excitation signal are input to the linear prediction filter after parameter configuration,
- the terminal obtains the adjacent sub-long-term filtering excitation signal corresponding to the first sub-frame in the historical long-term filtering excitation signal for the sub-long-term filtering excitation signal corresponding to the first sub-frame in the long-term filtering excitation signal.
- the apparatus further includes: a data abnormality determination module 1112 and a forward error correction frame signal determination module 1114, wherein: a data abnormality determination module 1112 is configured to determine the decoded voice packet before decoding the voice packet. Whether there is data abnormality in the historical voice packets of the audio signal; the FEC frame signal determination module 1114 is used for determining the audio signal obtained by decoding and filtering as the FEC frame signal if the data abnormality occurs in the historical voice packet.
- the terminal determines whether the decoded historical voice packet before decoding the current voice packet has data abnormality, thereby determining whether the current audio signal obtained by decoding and filtering is a forward error correction frame signal, and then can be used in the audio signal.
- the signal is a forward error correction frame signal
- audio signal enhancement processing is performed on it to further improve the quality of the audio signal.
- the feature parameters include cepstral feature parameters; the feature parameter extraction module 1104 is further configured to: perform Fourier transform on the audio signal to obtain a Fourier transformed audio signal; The audio signal is processed logarithmically to obtain logarithmic results; the logarithmic results are subjected to inverse Fourier transform to obtain cepstral characteristic parameters.
- the terminal can enhance the audio signal based on the extracted cepstrum characteristic parameter by extracting the cepstrum characteristic parameter from the audio signal, thereby improving the quality of the audio signal.
- the long-term filtering parameters include a pitch period and an amplitude gain value; the speech enhancement module 1108 is further configured to: filter the speech excitation signal according to the pitch period, the amplitude gain value, the linear filtering parameter and the cepstral characteristic parameter. The voice enhancement processing is performed to obtain the enhanced voice excitation signal.
- the terminal performs speech enhancement processing on the filter speech excitation signal according to the pitch period, the amplitude gain value, the linear filter parameter and the cepstral characteristic parameter to obtain the enhanced speech excitation signal, which can then be realized based on the enhanced speech excitation signal.
- the enhancement of the audio signal improves the quality of the audio signal.
- the signal conversion module 1106 is further configured to: perform parameter configuration on the linear prediction filter based on the linear filtering parameters, and perform linear decomposition filtering on the audio signal through the linear prediction filter after the parameter configuration, to obtain the filter speech excitation Signal.
- the terminal converts the audio signal into the filter speech excitation signal based on the linear filtering parameters, so that the audio signal can be enhanced by enhancing the filter speech excitation signal, thereby improving the quality of the audio signal.
- the speech enhancement module 1108 is further configured to: input the feature parameters, long-term filter parameters, linear filter parameters, and filter speech excitation signals into the pre-trained signal enhancement model, so that the signal enhancement model based on the feature parameter pairs The speech excitation signal of the filter is subjected to speech enhancement processing to obtain an enhanced speech excitation signal.
- the terminal realizes the enhancement of the voice excitation signal after the enhancement through the pre-trained signal enhancement model, and then can realize the enhancement of the audio signal based on the enhanced voice excitation signal, which improves the quality of the audio signal and the efficiency of the audio signal enhancement processing.
- the feature parameters include cepstrum feature parameters
- the speech enhancement module 1108 is further configured to: perform vectorization processing on the cepstrum feature parameters, long-term filter parameters and linear filter parameters, and concatenate the results obtained by the vectorization process Obtain the feature vector; input the feature vector and the filter speech excitation signal into the pre-trained signal enhancement model; perform feature extraction on the feature vector through the signal enhancement model to obtain the target feature vector; based on the target feature vector, the filter speech excitation signal is enhanced. , the enhanced speech excitation signal is obtained.
- the terminal obtains a feature vector by performing vectorization processing on the cepstrum feature parameters, long-term filtering parameters and linear filtering parameters, and splicing the results obtained by the vectorized processing;
- the signal enhancement model is based on the signal enhancement model; the feature vector is extracted by the signal enhancement model to obtain the target feature vector; based on the target feature vector, the filter speech excitation signal is enhanced, and the enhanced speech excitation signal is obtained, so that the signal enhancement model can be used.
- the enhancement processing of the audio signal improves the quality of the audio signal and the efficiency of the enhancement processing for the audio signal.
- the speech enhancement module 1108 is further configured to: perform Fourier transform on the filter speech excitation signal to obtain a frequency domain speech excitation signal; enhance the amplitude feature of the frequency domain speech excitation signal based on the target feature vector ; The inverse Fourier transform of the frequency domain speech excitation signal with the enhanced amplitude feature is obtained, and the enhanced speech excitation signal is obtained.
- the terminal obtains the frequency domain speech excitation signal by performing Fourier transform on the filter speech excitation signal; based on the target feature vector, the amplitude feature of the frequency domain speech excitation signal is enhanced; The inverse Fourier transform of the domain speech excitation signal is used to obtain the enhanced speech excitation signal, so that the audio signal can be enhanced while the phase information of the audio signal is kept unchanged, and the quality of the audio signal can be improved.
- the speech synthesis module 1110 is further configured to: perform parameter configuration on the linear prediction filter based on the linear filtering parameters, and perform linear synthesis filtering on the enhanced speech excitation signal through the linear prediction filter after the parameter configuration, to obtain speech boost the signal.
- the terminal can obtain a speech enhancement signal by performing linear synthesis filtering on the enhanced speech excitation signal, that is, the enhancement processing of the audio signal is realized, and the quality of the audio signal is improved.
- the linear filtering parameters include linear filtering coefficients and energy gain values; the speech synthesis module 1110 is further configured to: configure parameters for the linear prediction filter based on the linear filtering coefficients; obtain the history decoded before decoding the speech packets The energy gain value corresponding to the voice packet; the energy adjustment parameter is determined based on the energy gain value corresponding to the historical voice packet and the energy gain value corresponding to the voice packet; the energy adjustment is performed on the historical long-term filter excitation signal corresponding to the historical voice packet through the energy adjustment parameter , obtain the adjusted historical long-term filtering excitation signal; input the adjusted historical long-term filtering excitation signal and the enhanced speech excitation signal to the linear prediction filter after parameter configuration, so that the linear prediction filter is based on the adjusted historical long-term filtering The excitation signal is linearly synthesized and filtered on the enhanced speech excitation signal.
- the terminal performs parameter configuration on the linear prediction filter based on the linear filter coefficient; obtains the energy gain value corresponding to the historical voice packet decoded before decoding the voice packet; based on the energy gain value corresponding to the historical voice packet and the voice packet The corresponding energy gain value is used to determine the energy adjustment parameter; the energy adjustment is performed on the historical long-term filtering excitation signal corresponding to the historical voice packet through the energy adjustment parameter, and the adjusted historical long-term filtering excitation signal is obtained; the adjusted historical long-term filtering excitation signal is and the enhanced speech excitation signal are input to the linear prediction filter after parameter configuration, so that the linear prediction filter based on the adjusted historical long-term filtering excitation signal, performs linear synthesis filtering on the enhanced speech excitation signal, thereby smoothing different frames
- the audio signal of different frames improves the quality of the speech composed of audio signals of different frames.
- Each module in the above-mentioned audio signal enhancement apparatus can be implemented in whole or in part by software, hardware and combinations thereof.
- the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
- a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 13 .
- the computer device includes a processor, memory, and a network interface connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
- the memory of the computer device includes a non-volatile storage medium, an internal memory.
- the nonvolatile storage medium stores an operating system, a computer program, and a database.
- the internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium.
- the computer device's database is used to store voice packet data.
- the network interface of the computer device is used to communicate with an external terminal through a network connection.
- the computer program when executed by a processor implements an audio signal enhancement method.
- a computer device is provided, and the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 14 .
- the computer equipment includes a processor, memory, a communication interface, a display screen, and an input device connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
- the memory of the computer device includes a non-volatile storage medium, an internal memory.
- the nonvolatile storage medium stores an operating system and a computer program.
- the internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium.
- the communication interface of the computer device is used for wired or wireless communication with an external terminal, and the wireless communication can be realized by WIFI, operator network, NFC (Near Field Communication) or other technologies.
- the computer program when executed by a processor implements an audio signal enhancement method.
- the display screen of the computer equipment may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball or a touchpad set on the shell of the computer equipment , or an external keyboard, trackpad, or mouse.
- FIG. 13 or FIG. 14 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
- a computer device may include more or fewer components than those shown in the figures, or combine certain components, or have a different arrangement of components.
- a computer device including a memory and a processor, where a computer program is stored in the memory, and the processor implements the steps in the foregoing method embodiments when the processor executes the computer program.
- a computer-readable storage medium which stores a computer program, and when the computer program is executed by a processor, implements the steps in the foregoing method embodiments.
- a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
- the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps in the foregoing method embodiments.
- Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like.
- Volatile memory may include random access memory (RAM) or external cache memory.
- the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
Claims (20)
- 一种音频信号增强方法,由计算机设备执行,其特征在于,所述方法包括:对接收到的语音包依序进行解码,得到残差信号、长时滤波参数和线性滤波参数;对所述残差信号进行滤波,得到音频信号;当所述音频信号为前向纠错帧信号时,从所述音频信号中提取特征参数;基于所述线性滤波参数,将所述音频信号转换为滤波器语音激励信号;根据所述特征参数、所述长时滤波参数和所述线性滤波参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号;基于所述增强后语音激励信号和所述线性滤波参数进行语音合成,得到语音增强信号。
- 根据权利要求1所述的方法,其特征在于,所述对所述残差信号进行滤波,得到音频信号,包括:基于所述长时滤波参数对长时预测滤波器进行参数配置,通过参数配置后的长时预测滤波器对所述残差信号进行长时合成滤波,得到长时滤波激励信号;基于所述线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对所述长时滤波激励信号进行线性合成滤波,得到音频信号。
- 根据权利要求2所述的方法,其特征在于,所述基于所述线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对所述长时滤波激励信号进行线性合成滤波,得到音频信号,包括:将所述长时滤波激励信号分为至少两个子帧,得到子长时滤波激励信号;对所述线性滤波参数进行分组,得到至少两个线性滤波参数集合;基于所述线性滤波参数集合分别对至少两个线性预测滤波器进行参数配置;将所得的子长时滤波激励信号分别输入参数配置后的线性预测滤波器,以使所述线性预测滤波器基于所述线性滤波参数集合对所述子长时滤波激励信号进行线性合成滤波,得到与各子帧对应的子音频信号;按照各所述子帧的时序对所述子音频信号进行组合,得到音频信号。
- 根据权利要求3所述的方法,其特征在于,所述线性滤波参数包括线性滤波系数和能量增益值;所述方法还包括:针对所述长时滤波激励信号中的第一子帧对应的子长时滤波激励信号,获取历史长时滤 波激励信号中与所述第一子帧对应的子长时滤波激励信号相邻的子帧的历史子长时滤波激励信号的能量增益值;基于所述历史子长时滤波激励信号对应的能量增益值和所述第一子帧对应的子长时滤波激励信号的能量增益值,确定所述子长时滤波激励信号对应的能量调整参数;通过所述能量调整参数对所述历史子长时滤波激励信号进行能量调整;所述将所得的子长时滤波激励信号分别输入参数配置后的线性预测滤波器,以使所述线性预测滤波器基于所述线性滤波参数集合对所述子长时滤波激励信号进行线性合成滤波,得到与各子帧对应的子音频信号,包括:将所得的子长时滤波激励信号和能量调整后所得的历史子长时滤波激励信号输入至参数配置后的线性预测滤波器,以使所述线性预测滤波器基于所述线性滤波系数和所述能量调整后所得的历史子长时滤波激励信号,对所述第一子帧对应的子长时滤波激励信号进行线性合成滤波,得到第一子帧对应的子音频信号。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:确定在解码所述语音包之前所解码的历史语音包是否出现数据异常;若所述历史语音包出现数据异常时,则确定经过解码和滤波所得的所述音频信号为前向纠错帧信号。
- 根据权利要求1所述的方法,其特征在于,所述特征参数包括倒谱特征参数;所述从所述音频信号中提取特征参数,包括:对所述音频信号进行傅里叶变换,得到傅里叶变换后的音频信号;将所述傅里叶变换后的音频信号进行对数处理,得到对数结果;对所述对数结果进行傅里叶逆变换,得到倒谱特征参数。
- 根据权利要求6所述的方法,其特征在于,所述长时滤波参数包括基音周期和幅度增益值;所述根据所述特征参数、所述长时滤波参数和所述线性滤波参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号,包括:根据所述基音周期、幅度增益值、所述线性滤波参数和所述倒谱特征参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号。
- 根据权利要求1所述的方法,其特征在于,所述基于所述线性滤波参数,将所述音频 信号转换为滤波器语音激励信号,包括:基于所述线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对所述音频信号进行线性分解滤波,得到滤波器语音激励信号。
- 根据权利要求1所述的方法,其特征在于,所述根据所述特征参数、所述长时滤波参数和所述线性滤波参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号,包括:将所述特征参数、所述长时滤波参数、所述线性滤波参数、所述滤波器语音激励信号输入预训练的信号增强模型,以使所述信号增强模型基于所述特征参数对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号。
- 根据权利要求9所述的方法,其特征在于,所述特征参数包括倒谱特征参数;所述将所述特征参数、所述长时滤波参数、所述线性滤波参数、所述滤波器语音激励信号输入预训练的信号增强模型,以使所述信号增强模型基于所述特征参数对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号,包括:对所述倒谱特征参数、所述长时滤波参数和所述线性滤波参数进行向量化处理,并拼接向量化处理所得的结果得到特征向量;将所述特征向量、所述滤波器语音激励信号输入预训练的信号增强模型;通过所述信号增强模型对所述特征向量进行特征提取,得到目标特征向量;基于所述目标特征向量对所述滤波器语音激励信号进行增强处理,得到增强后语音激励信号。
- 根据权利要求10所述的方法,其特征在于,所述基于所述目标特征向量对所述滤波器语音激励信号进行增强处理,得到增强后语音激励信号,包括:对所述滤波器语音激励信号进行傅里叶变换,得到频域语音激励信号;基于所述目标特征向量对所述频域语音激励信号的幅值特征进行增强;对增强所述幅值特征的频域语音激励信号傅里叶逆变换,得到增强后语音激励信号。
- 根据权利要求1所述的方法,其特征在于,所述基于所述增强后语音激励信号和所述线性滤波参数进行语音合成,得到语音增强信号,包括:基于所述线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对所述增强后语音激励信号进行线性合成滤波,得到语音增强信号。
- 根据权利要求12所述的方法,其特征在于,所述线性滤波参数包括线性滤波系数和能量增益值;所述基于所述线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对所述增强后语音激励信号进行线性合成滤波,包括:基于所述线性滤波系数对线性预测滤波器进行参数配置;获取在解码所述语音包之前所解码的历史语音包对应的能量增益值;基于所述历史语音包对应的能量增益值和所述语音包对应的能量增益值,确定能量调整参数;通过所述能量调整参数对所述历史语音包对应的历史长时滤波激励信号进行能量调整,得到调整后历史长时滤波激励信号;将所述调整后历史长时滤波激励信号和所述增强后语音激励信号输入至参数配置后的线性预测滤波器,以使所述线性预测滤波器基于所述调整后历史长时滤波激励信号,对所述增强后语音激励信号进行线性合成滤波。
- 一种音频信号增强装置,其特征在于,所述装置包括:语音包处理模块,用于对接收到的语音包依序进行解码,得到残差信号、长时滤波参数和线性滤波参数;对所述残差信号进行滤波,得到音频信号;特征参数提取模块,用于当所述音频信号为前向纠错帧信号时,从所述音频信号中提取特征参数;信号转换模块,用于基于所述线性滤波参数,将所述音频信号转换为滤波器语音激励信号;语音增强模块,用于根据所述特征参数、所述长时滤波参数和所述线性滤波参数,对所述滤波器语音激励信号进行语音增强处理,得到增强后语音激励信号;语音合成模块,用于基于所述增强后语音激励信号和所述线性滤波参数进行语音合成,得到语音增强信号。
- 根据权利要求14所述的装置,其特征在于,所述语音包处理模块还用于:基于所述长时滤波参数对长时预测滤波器进行参数配置,通过参数配置后的长时预测滤波器对所述残差信号进行长时合成滤波,得到长时滤波激励信号;基于所述线性滤波参数对线性预测滤波器进行参数配置,通过参数配置后的线性预测滤波器对所述长时滤波激励信号进行线性合成滤波,得到音频信号。
- 根据权利要求15所述的装置,其特征在于,所述语音包处理模块还用于:将所述长时滤波激励信号分为至少两个子帧,得到子长时滤波激励信号;对所述线性滤波参数进行分组,得到至少两个线性滤波参数集合;基于所述线性滤波参数集合分别对至少两个线性预测滤波器进行参数配置;将所得的子长时滤波激励信号分别输入参数配置后的线性预测滤波器,以使所述线性预测滤波器基于所述线性滤波参数集合对所述子长时滤波激励信号进行线性合成滤波,得到与各子帧对应的子音频信号;按照各所述子帧的时序对所述子音频信号进行组合,得到音频信号。
- 根据权利要求16所述的装置,其特征在于,所述线性滤波参数包括线性滤波系数和能量增益值;所述语音包处理模块还用于:针对所述长时滤波激励信号中的第一子帧对应的子长时滤波激励信号,获取历史长时滤波激励信号中与所述第一子帧对应的子长时滤波激励信号相邻的子帧的历史子长时滤波激励信号的能量增益值;基于所述历史子长时滤波激励信号对应的能量增益值和所述第一子帧对应的子长时滤波激励信号的能量增益值,确定所述子长时滤波激励信号对应的能量调整参数;通过所述能量调整参数对所述历史子长时滤波激励信号进行能量调整;将所得的子长时滤波激励信号和能量调整后所得的历史子长时滤波激励信号输入至参数配置后的线性预测滤波器,以使所述线性预测滤波器基于所述线性滤波系数和所述能量调整后所得的历史子长时滤波激励信号,对所述第一子帧对应的子长时滤波激励信号进行线性合成滤波,得到第一子帧对应的子音频信号。
- 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至13中任一项所述的方法的步骤。
- 一种计算机可读存储介质,存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至13中任一项所述的方法的步骤。
- 一种计算机程序产品,包括计算机程序,其特征在于,该计算机程序被处理器执行时实现权利要求1至13中任一项所述的方法的步骤。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2023535590A JP2023553629A (ja) | 2021-04-30 | 2022-04-15 | オーディオ信号強化方法、装置、コンピュータ機器及びコンピュータプログラム |
EP22794615.9A EP4297025A1 (en) | 2021-04-30 | 2022-04-15 | Audio signal enhancement method and apparatus, computer device, storage medium, and computer program product |
US18/076,116 US20230099343A1 (en) | 2021-04-30 | 2022-12-06 | Audio signal enhancement method and apparatus, computer device, storage medium and computer program product |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110484196.6 | 2021-04-30 | ||
CN202110484196.6A CN113763973A (zh) | 2021-04-30 | 2021-04-30 | 音频信号增强方法、装置、计算机设备和存储介质 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/076,116 Continuation US20230099343A1 (en) | 2021-04-30 | 2022-12-06 | Audio signal enhancement method and apparatus, computer device, storage medium and computer program product |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022228144A1 true WO2022228144A1 (zh) | 2022-11-03 |
Family
ID=78786944
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/086960 WO2022228144A1 (zh) | 2021-04-30 | 2022-04-15 | 音频信号增强方法、装置、计算机设备、存储介质和计算机程序产品 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20230099343A1 (zh) |
EP (1) | EP4297025A1 (zh) |
JP (1) | JP2023553629A (zh) |
CN (1) | CN113763973A (zh) |
WO (1) | WO2022228144A1 (zh) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113763973A (zh) * | 2021-04-30 | 2021-12-07 | 腾讯科技(深圳)有限公司 | 音频信号增强方法、装置、计算机设备和存储介质 |
CN113938749B (zh) * | 2021-11-30 | 2023-05-05 | 北京百度网讯科技有限公司 | 音频数据处理方法、装置、电子设备和存储介质 |
CN116994587B (zh) * | 2023-09-26 | 2023-12-08 | 成都航空职业技术学院 | 一种培训监管系统 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103714820A (zh) * | 2013-12-27 | 2014-04-09 | 广州华多网络科技有限公司 | 参数域的丢包隐藏方法及装置 |
CN105765651A (zh) * | 2013-10-31 | 2016-07-13 | 弗朗霍夫应用科学研究促进协会 | 用于使用基于时域激励信号的错误隐藏提供经解码的音频信息的音频解码器及方法 |
CN107248411A (zh) * | 2016-03-29 | 2017-10-13 | 华为技术有限公司 | 丢帧补偿处理方法和装置 |
CN111554308A (zh) * | 2020-05-15 | 2020-08-18 | 腾讯科技(深圳)有限公司 | 一种语音处理方法、装置、设备及存储介质 |
CN112489665A (zh) * | 2020-11-11 | 2021-03-12 | 北京融讯科创技术有限公司 | 语音处理方法、装置以及电子设备 |
WO2021050155A1 (en) * | 2019-09-09 | 2021-03-18 | Qualcomm Incorporated | Artificial intelligence based audio coding |
CN113763973A (zh) * | 2021-04-30 | 2021-12-07 | 腾讯科技(深圳)有限公司 | 音频信号增强方法、装置、计算机设备和存储介质 |
-
2021
- 2021-04-30 CN CN202110484196.6A patent/CN113763973A/zh active Pending
-
2022
- 2022-04-15 EP EP22794615.9A patent/EP4297025A1/en active Pending
- 2022-04-15 JP JP2023535590A patent/JP2023553629A/ja active Pending
- 2022-04-15 WO PCT/CN2022/086960 patent/WO2022228144A1/zh active Application Filing
- 2022-12-06 US US18/076,116 patent/US20230099343A1/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105765651A (zh) * | 2013-10-31 | 2016-07-13 | 弗朗霍夫应用科学研究促进协会 | 用于使用基于时域激励信号的错误隐藏提供经解码的音频信息的音频解码器及方法 |
CN103714820A (zh) * | 2013-12-27 | 2014-04-09 | 广州华多网络科技有限公司 | 参数域的丢包隐藏方法及装置 |
CN107248411A (zh) * | 2016-03-29 | 2017-10-13 | 华为技术有限公司 | 丢帧补偿处理方法和装置 |
WO2021050155A1 (en) * | 2019-09-09 | 2021-03-18 | Qualcomm Incorporated | Artificial intelligence based audio coding |
CN111554308A (zh) * | 2020-05-15 | 2020-08-18 | 腾讯科技(深圳)有限公司 | 一种语音处理方法、装置、设备及存储介质 |
CN112489665A (zh) * | 2020-11-11 | 2021-03-12 | 北京融讯科创技术有限公司 | 语音处理方法、装置以及电子设备 |
CN113763973A (zh) * | 2021-04-30 | 2021-12-07 | 腾讯科技(深圳)有限公司 | 音频信号增强方法、装置、计算机设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
JP2023553629A (ja) | 2023-12-25 |
CN113763973A (zh) | 2021-12-07 |
US20230099343A1 (en) | 2023-03-30 |
EP4297025A1 (en) | 2023-12-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022228144A1 (zh) | 音频信号增强方法、装置、计算机设备、存储介质和计算机程序产品 | |
CN111247585B (zh) | 语音转换方法、装置、设备及存储介质 | |
KR101246991B1 (ko) | 오디오 신호 처리 방법 | |
TWI484479B (zh) | 用於低延遲聯合語音及音訊編碼中之錯誤隱藏之裝置和方法 | |
TW200401532A (en) | Distributed voice recognition system utilizing multistream network feature processing | |
US11594236B2 (en) | Audio encoding/decoding based on an efficient representation of auto-regressive coefficients | |
WO2009055192A1 (en) | Method and apparatus for generating an enhancement layer within an audio coding system | |
WO2010077542A1 (en) | Method and apprataus for generating an enhancement layer within a multiple-channel audio coding system | |
US20220180881A1 (en) | Speech signal encoding and decoding methods and apparatuses, electronic device, and storage medium | |
WO2021227749A1 (zh) | 一种语音处理方法、装置、电子设备及计算机可读存储介质 | |
US8762141B2 (en) | Reduced-complexity vector indexing and de-indexing | |
US20230377584A1 (en) | Real-time packet loss concealment using deep generative networks | |
JP3357795B2 (ja) | 音声符号化方法および装置 | |
CN111554322A (zh) | 一种语音处理方法、装置、设备及存储介质 | |
CN112908293B (zh) | 一种基于语义注意力机制的多音字发音纠错方法及装置 | |
CN111554323A (zh) | 一种语音处理方法、装置、设备及存储介质 | |
CN111554308A (zh) | 一种语音处理方法、装置、设备及存储介质 | |
JP2024516664A (ja) | デコーダ | |
Li et al. | A Two-stage Approach to Quality Restoration of Bone-conducted Speech | |
Benamirouche et al. | Low complexity forward error correction for CELP-type speech coding over erasure channel transmission | |
KR102132326B1 (ko) | 통신 시스템에서 오류 은닉 방법 및 장치 | |
Huang et al. | A Two-Stage Training Framework for Joint Speech Compression and Enhancement | |
Lan et al. | Shortcut-Based Fully Convolutional Network for Speech Enhancement | |
EP3252763A1 (en) | Low-delay audio coding | |
JPH11272298A (ja) | 音声通信方法及び音声通信装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22794615 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023535590 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022794615 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022794615 Country of ref document: EP Effective date: 20230920 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |