WO2022012195A1 - Procédé de traitement de signal audio et appareil associé - Google Patents

Procédé de traitement de signal audio et appareil associé Download PDF

Info

Publication number
WO2022012195A1
WO2022012195A1 PCT/CN2021/097663 CN2021097663W WO2022012195A1 WO 2022012195 A1 WO2022012195 A1 WO 2022012195A1 CN 2021097663 W CN2021097663 W CN 2021097663W WO 2022012195 A1 WO2022012195 A1 WO 2022012195A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
signal
frequency domain
audio
domain signal
Prior art date
Application number
PCT/CN2021/097663
Other languages
English (en)
Chinese (zh)
Inventor
梁俊斌
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022012195A1 publication Critical patent/WO2022012195A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters

Definitions

  • the embodiments of the present application relate to the technical field of artificial intelligence, and in particular, to audio signal processing.
  • the voice receiving device needs to perform voice enhancement processing on the received voice signal.
  • the algorithms for voice enhancement processing for voice receiving devices of different manufacturers, different kernel versions, and different application software may be different.
  • the speech signal after speech enhancement processing may have different degrees of speech impairment.
  • the compensation strategy is set based on the measured distortion results, and then on the equipment corresponding to the device model and software version, the spectrum gain compensation is performed through the set compensation strategy.
  • an embodiment of the present application provides an audio signal processing method, the method includes:
  • the first audio signal is processed through a spectrum compensation model to obtain a prediction result of predictive compensation for the distorted spectrum in the first audio signal;
  • the neural network model obtained by training the original audio samples corresponding to the audio samples;
  • the first audio signal is reconstructed to obtain a target audio signal after repairing the distorted spectrum in the first audio signal.
  • an embodiment of the present application provides an audio signal processing apparatus, and the apparatus includes:
  • a signal acquisition module for acquiring the first audio signal
  • a result acquisition module configured to process the first audio signal through a spectrum compensation model to obtain a prediction result of performing prediction compensation on the distorted spectrum in the first audio signal;
  • the spectrum compensation model is to use the spectrum distorted audio sample , and the neural network model obtained by training the original audio samples corresponding to the spectrally distorted audio samples;
  • a target acquisition module configured to reconstruct the first audio signal according to the prediction result, and obtain a target audio signal after repairing the distorted spectrum in the first audio signal.
  • an embodiment of the present application provides a computer device, the computer device includes a processor and a memory, and the memory stores at least one instruction, at least a piece of program, code set or instruction set, so The at least one instruction, the at least one piece of program, the code set or the instruction set are loaded and executed by the processor to implement the audio signal processing method of the above aspect.
  • an embodiment of the present application provides a computer-readable storage medium, where the storage medium is used for storing a computer program, and the computer program is used for executing the audio signal processing method in the above aspect.
  • a computer program product or computer program where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, the processor executes the computer instructions, and causes the computer device to perform the audio signal processing method of the above-mentioned aspect.
  • FIG. 1 is a framework diagram of model training and prediction compensation according to an exemplary embodiment
  • FIG. 2 is a model architecture diagram of a machine learning model according to an exemplary embodiment embodiment
  • FIG. 3 is a schematic diagram of an audio signal processing method according to an exemplary embodiment
  • FIG. 4 is a schematic diagram of an audio signal processing method according to an exemplary embodiment
  • FIG. 5 is an architectural diagram of an audio signal processing method according to an exemplary embodiment
  • FIG. 6 is a schematic diagram of a voice restoration applied in a voice call system according to an exemplary embodiment
  • FIG. 7 is a schematic diagram of another voice restoration applied in a voice call system according to an exemplary embodiment
  • FIG. 8 is a structural block diagram of an audio signal processing method according to an exemplary embodiment
  • Fig. 9 is a schematic structural diagram of a computer device according to an exemplary embodiment.
  • the spectral compensation model is obtained by training the spectrally distorted audio samples and the original audio samples corresponding to the spectrally distorted audio samples.
  • Speech enhancement processing In the application scenario of a voice call, the recording signal collected by the device needs to undergo voice enhancement processing, which mainly includes: echo cancellation, noise suppression, volume self-adjustment, frequency response equalization, etc.
  • the voice enhancement processing can be implemented through hardware, for example, voice enhancement processing can be implemented through some audio chips, or a respective voice enhancement processing function module can be added at the application layer.
  • the embodiments of the present application involve the application of the speech enhancement processing technology, for example, the first audio signal is obtained through the speech enhancement processing technology.
  • the type of damage can include high-frequency damage.
  • the high-frequency damage can be the high-frequency damage of the voice signal caused by the noise reduction or echo cancellation algorithm, which is manifested in that the original high-frequency information in the voice signal is significantly weakened, which causes the sound corresponding to the voice signal to become dull and unclear;
  • the type of damage may include frequency band damage, which means that signals in certain fixed frequency bands are fixedly attenuated, which may be caused by poor equalization processing, resulting in obvious distortion of the sound corresponding to the speech signal.
  • FIG. 1 is a framework diagram of model training and prediction compensation according to an exemplary embodiment.
  • the model training device 110 trains an end-to-end machine learning model through a pre-prepared sample set including original audio samples and their corresponding spectrally distorted audio samples.
  • the prediction compensation stage predicting The device 120 directly predicts, according to the trained machine learning model and the input first audio signal, a prediction result of performing prediction compensation on the distortion spectrum in the first audio signal.
  • the above-mentioned model training device 110 and prediction device 120 may be computer devices with machine learning capabilities.
  • the computer devices may be stationary computer devices such as personal computers and servers, or the computer devices may also be tablet computers or electronic devices.
  • Mobile computer equipment such as book readers.
  • the model training device 110 and the prediction device 120 may be the same device, or the model training device 110 and the prediction device 120 may also be different devices.
  • the model training device 110 and the prediction device 120 may be the same type of device, for example, the model training device 110 and the prediction device 120 may both be personal computers;
  • the training device 110 and the prediction device 120 may also be different types of devices, for example, the model training device 110 may be a server, and the prediction device 120 may be a mobile terminal device or the like.
  • the embodiments of the present application do not limit the specific types of the model training device 110 and the prediction device 120 .
  • Fig. 2 is a model architecture diagram of a machine learning model according to an exemplary embodiment.
  • the machine learning model 20 in this embodiment of the present application may include a sample set for generating samples and storing samples, and a spectral compensation model 210 part, wherein the sample set stores the original audio samples obtained through collection and the corresponding Spectrally distorted audio samples, or original audio samples and corresponding spectrally distorted audio samples generated by artificial construction.
  • Each spectrally distorted audio sample stored in the sample set is input into the spectral compensation model 210, and the corresponding original audio sample is used as the output target to perform model training on the spectral compensation model 210.
  • an audio signal, and outputting a prediction result that is, the prediction spectrum restoration audio signal corresponding to the first audio signal.
  • FIG. 3 is a schematic diagram of an audio signal processing method according to an exemplary embodiment, and the audio signal processing method can be executed by an audio processing device.
  • the above audio processing device may be the prediction device 120 in the system shown in FIG. 1 above.
  • the audio signal processing method may include the following steps:
  • Step 301 acquiring a first audio signal.
  • the first audio signal is an original audio signal collected by an audio processing device, an audio signal obtained after voice enhancement processing in the inherent hardware or software of the audio processing device, or the first audio signal It is an audio signal collected by an audio processing device that has undergone speech enhancement processing.
  • the first audio signal will be distorted in different situations due to speech enhancement processing performed by different algorithms.
  • the algorithms used by the audio processing device to perform speech enhancement processing on the audio signal through inherent hardware or software may vary according to the manufacturer of the audio processing device, the kernel version and the type of application software. As a result, the obtained first audio signals have different distortions.
  • the voice enhancement processing can include echo cancellation processing, noise suppression processing, volume self-adjustment processing, and frequency response equalization processing.
  • Audio processing equipment of different manufacturers or different kernel versions and different types of application software can perform voice enhancement processing. The degree of emphasis on each aspect is different, so that the distortion in the first audio signal obtained after the speech enhancement processing is different.
  • the distortion in the first audio signal can be expressed as high-frequency damage
  • the high-frequency damage may be high-frequency damage caused by noise suppression processing or echo cancellation processing, that is, the original high-frequency information in the audio signal is significantly weakened, resulting in sound It becomes dull and unclear; it can also be frequency band damage, that is, some fixed frequency band signals are fixedly attenuated, which may be due to the fact that the frequency response equalization process has not been processed well, resulting in obvious distortion of the audio frequency.
  • Step 302 Process the first audio signal by using a spectrum compensation model to obtain a prediction result of performing prediction compensation on the distorted spectrum in the first audio signal; A neural network model trained on audio samples.
  • the first audio signal is input into a spectrum compensation model, and a situation where the predicted first audio signal is undistorted is output as a prediction result.
  • the spectrum compensation model is a neural network model obtained by updating relevant parameters by training the spectrum-distorted audio samples and the original audio samples corresponding to the spectrum-distorted audio samples.
  • Step 303 reconstruct the first audio signal according to the prediction result, and obtain a target audio signal after repairing the distorted spectrum in the first audio signal.
  • the first audio signal can be reconstructed through the prediction result of the first audio signal output by the neural network model, and an audio signal after the first audio signal has been repaired is generated as the target audio signal.
  • the target audio signal is used as the audio signal after repairing the first audio signal, which can solve the problem of audio distortion in practical applications.
  • the terminal on the side sending the voice receives the user's voice signal
  • the received voice signal needs to be processed by voice enhancement first, and then the voice signal after the voice enhancement processing is reconstructed and repaired , send the repaired voice signal to the terminal on the voice receiving side, and play it, so that the user on the voice receiving side can clearly receive the voice content.
  • the reconstructed and repaired audio signal can also be used to process recorded audio, optimize audio during live broadcast, process audio with damaged sound quality by music playback software, and optimize audio by video playback software.
  • the neural network model obtained through model training can make compensation predictions for different equipment models and software versions uniformly, which solves the problem of limitations in the application scenarios of speech signal repair, thereby improving the generality of speech signal repair. sex.
  • FIG. 4 is a schematic diagram of an audio signal processing method according to an exemplary embodiment, and the audio signal processing method can be executed by an audio processing device.
  • the above audio processing device may be the model training device 110 and the prediction device 120 in the system shown in FIG. 1 .
  • the audio signal processing method may include two stages: a model training stage and a model application stage, which may be performed by the model training device 110 during the model training stage, wherein the model offline training stage may include the following steps:
  • step 401 raw audio samples are obtained.
  • the acquired original audio samples may be collected from the outside world, or obtained by artificial construction.
  • the original audio samples may be pre-stored in the model training device, or collected and stored by the model training device.
  • the prepared different types of noise sequences may be babble (speech) noise, street (street) noise, office (office) noise, or white (white) noise, and the like.
  • step 402 a power spectrum value suppression process is performed on a part of the frequency band in the frequency domain signal corresponding to the original audio sample to obtain a spectrally distorted audio sample corresponding to the original audio sample.
  • part of frequency bands on the frequency-domain signals are randomly selected for suppression processing.
  • a number of spectrally distorted audio samples corresponding to the original audio samples can be obtained by multiplying the power spectrum value corresponding to a part of the frequency band randomly selected on the frequency domain signal by a random value less than or equal to 1.
  • the partial frequency band on the frequency domain signal can also be selected according to the actual application.
  • the original audio sample is an audio signal with an 8khz spectrum
  • the part of the signal whose frequency is below 2khz is relatively less damaged, and the part of the signal above 4khz is more seriously damaged, so it can be distributed according to the frequency band damage probability. frequency band extraction.
  • step 403 using the spectrally distorted audio samples as input, and using the original audio samples as the training target, machine learning training is performed to obtain a spectral compensation model.
  • the original audio samples and the corresponding spectrally distorted audio samples obtained by the above-mentioned methods can be used to train a spectral compensation model and update the parameters of the model.
  • the spectral compensation model is a neural network model obtained by training the spectrally distorted audio samples and the original audio samples corresponding to the spectrally distorted audio samples.
  • the spectral compensation model is an RNN (Recurrent Neural Network, Recurrent Neural Network) model or an LSTM (Long Short-Term Memory, Long Short-Term Memory Network) model.
  • logarithmic processing is required between the spectrally distorted audio samples on the input side used for training the spectral compensation model and the original audio samples on the target output side.
  • the power spectrum value corresponding to the spectrally distorted audio sample on the input side and the power spectrum value corresponding to the original audio sample need to be logarithmically processed, and the calculation formula of the logarithmic processing can be as follows:
  • S dB (i, k) is the logarithmic value of the power spectrum, i is the corresponding frame number, and k is the corresponding frequency index value.
  • the spectral compensation model obtained by training the sample set obtained after logarithmic processing still has logarithmic values on the input side and the output side in practical applications.
  • the model training device can complete the training and update of the spectrum compensation model.
  • model application phase may be performed by the prediction device 120, wherein the model application phase may include the following steps:
  • step 404 a first audio signal is acquired.
  • the first audio signal after audio enhancement processing is acquired.
  • the first audio signal may be a time-domain signal with partial distortion.
  • step 405 the first audio signal is converted into a corresponding frequency domain signal.
  • the first audio signal is a time-frequency signal, and the time-frequency signal is converted into a corresponding frequency-domain signal through an operation for subsequent calculation.
  • the first audio signal needs to be framed and windowed to achieve the purpose of truncating the sampling time and processing the limited signal again.
  • the function window may include a rectangular window, a triangular window, a Haining window, a Hamming window, a Kaiser window, and the like. Windowing of the first audio signal may use a squaring of the Hamming window.
  • the window function corresponding to the square root of the Hamming window used can be as follows:
  • n takes the integer value in [0, N-1], and N is the length of the window sample point of 20ms.
  • the windowed first audio signal can be obtained by multiplying the first audio signal by a window function, and the specific formula is as follows:
  • x in is composed of the time domain signal including the 10ms of the previous frame and the time domain signal of the current frame of 10ms.
  • frequency domain conversion is performed on the processed time domain signal to obtain a corresponding frequency domain signal.
  • the manner of converting the time domain signal into the frequency domain signal may include different algorithms.
  • a corresponding frequency-domain signal is obtained by performing discrete Fourier transform (Discrete Fourier Transformation, DFT) on the processed time-domain signal.
  • DFT discrete Fourier Transformation
  • the amplitude X(i, k) of each frequency point in the frequency domain signal can be obtained by discrete Fourier transform.
  • a corresponding frequency-domain signal is obtained by performing an improved discrete cosine transform on the processed time-domain signal.
  • the modified discrete cosine transform (Modified Discrete Cosine Transform, MDCT) is a transform related to the Fourier transform, based on the fourth type discrete cosine transform (DCT-IV).
  • MDCT Modified Discrete Cosine Transform
  • DCT-IV fourth type discrete cosine transform
  • the modified discrete cosine transform is similar to the discrete Fourier transform, but only uses real numbers.
  • the calculation method is similar to the algorithm of discrete Fourier transform.
  • step 406 the frequency domain signal is divided into at least one subband frequency domain signal.
  • the entire frequency domain signal corresponding to the first audio signal is divided into at least one subband frequency domain signal.
  • dividing the frequency domain signal into at least one subband may include dividing during linear frequency domain transform and dividing during nonlinear frequency domain transform.
  • the frequency domain signal when performing linear frequency domain transformation, is divided into at least one subband frequency domain signal, and the frequency domain signal can be divided into at least one subband frequency domain signal by averaging.
  • the frequency domain signal when performing nonlinear frequency domain transformation, is divided into at least one subband frequency domain signal, and the Bark domain can be used as a scale to divide the frequency domain signal into at least one subband. frequency domain signal.
  • the Bark domain is the 24 critical frequency bands that the auditory filter simulates.
  • the Bark domain can be used to describe the signal.
  • the frequency domain signal is divided into at least one subband frequency domain signal through the Bark domain, and each subband frequency domain signal is uneven. of.
  • the method for dividing the frequency domain signal into at least one subband frequency domain signal may be performed on the basis of dividing the frequency domain signal into at least one subband frequency domain signal equally divided.
  • the self-contained serial number corresponding to each frequency point is obtained.
  • step 407 frequency bin amplitudes in at least one subband frequency domain signal are determined.
  • the amplitude of each frequency point included in each subband frequency signal is determined.
  • the frequency point sequence number included in each subband frequency signal is determined according to the division of at least one subband frequency signal.
  • the corresponding amplitude of each frequency point can be determined according to the frequency point serial number.
  • the frequency point amplitude can be obtained by calculation through the formula shown in step 402 .
  • the uniformly divided subband is converted into subband division based on the scale of the Bark domain, and the amplitude of the frequency point of the 0th subband can be
  • step 408 the power spectrum value of the frequency domain signal of at least one subband is determined according to the frequency bin amplitude.
  • the power spectrum value corresponding to each subband frequency domain signal is determined by acquiring the amplitude value of each frequency point included in each subband frequency domain signal.
  • the amplitude of each frequency point obtained after the Fourier transform can be further calculated as the square of the amplitude of each frequency point, that is, the corresponding power spectrum value.
  • calculation formula for calculating the corresponding power spectrum value through the amplitude value of each frequency point may be as follows:
  • i is the corresponding frame sequence number
  • k is the corresponding frequency point index value
  • the amplitude of each frequency point obtained after the improved discrete cosine transform can also be calculated as the square of the amplitude of each frequency point as the corresponding power spectrum value.
  • step 409 the power spectrum value of at least one subband frequency domain signal is input into the spectrum compensation model, and the subband prediction results corresponding to each subband frequency domain signal respectively are obtained.
  • the power spectrum value corresponding to the frequency domain signal of each subband is input into the spectrum compensation model, and the predicted power spectrum value corresponding to the frequency domain signal of each subband is obtained as the subband of the frequency domain signal of each subband forecast result.
  • a prediction result of performing prediction compensation on the distortion spectrum in the first audio signal is determined according to the subband prediction result.
  • logarithmization is performed on the power spectrum values corresponding to the frequency domain signals of each subband, and the logarithmized power spectrum values are input into the spectrum compensation model to obtain the predicted power spectrum values.
  • the logarithmic value of as the prediction result.
  • step 410 according to the prediction result, the reconstructed power spectrum value of the first audio signal is obtained.
  • the reconstructed power spectrum value corresponding to each subband frequency domain signal is obtained according to the prediction result corresponding to each subband frequency domain signal output by the spectrum compensation model.
  • the reconstructed power spectrum value may be a logarithmic value.
  • the reconstructed power spectrum value of the first audio signal may be directly obtained according to the prediction result.
  • the predicted power spectrum value corresponding to at least one subband frequency domain signal is directly used as the reconstructed power spectrum value.
  • the logarithm of the power spectrum is can directly as the reconstructed power spectrum value.
  • the reconstructed power spectrum value is determined according to the prediction result and the power spectrum value corresponding to each input subband frequency domain signal.
  • the sum of the power spectrum value corresponding to the first audio signal and the frequency band impairment rate may be used as the reconstructed power spectrum value.
  • the frequency band impairment rate is a historical smoothed value of the difference between the predicted power spectrum value corresponding to the frequency domain signal of at least one subband and the power spectrum value corresponding to the first audio signal.
  • the logarithm of the power spectrum is The difference between the actual logarithm value of the power spectrum and the predicted logarithm value of the power spectrum for each frequency point is:
  • the historical smoothed value can be calculated as follows Formula to calculate:
  • is a parameter whose value range is (0,1).
  • the difference between the actual logarithm value of the power spectrum and the predicted logarithm value of the power spectrum corresponding to the frequency point is used as the frequency point.
  • Gain to determine the frequency bin amplitude after reconstruction.
  • step 411 a target audio signal corresponding to the reconstructed power spectrum value is generated.
  • the generated logarithm of the reconstructed power spectrum value is converted into a linear value, the frequency point amplitude corresponding to the power spectrum value is determined, and the corresponding time domain signal is generated according to the frequency point amplitude as a target audio signal.
  • the respective predictions corresponding to at least one subband frequency domain signal as the reconstructed power spectrum value is converted into a linear value, and the linear value is squared to determine the amplitude of each frequency point.
  • the power spectrum corresponding to each subband frequency domain signal is The value is added to the frequency band impairment rate to obtain the reconstructed power spectrum value.
  • the reconstructed power spectrum value is converted into a linear value, the linear value is squared, and the amplitude of each frequency point is determined.
  • the formula for calculating the reconstructed power spectrum value can be as follows:
  • the calculation formula for converting the logarithmic value of the power spectrum to a linear value can be as follows:
  • the amplitude of each frequency point in the discrete Fourier transform domain can be obtained by Calculation.
  • the time domain transform is performed on the frequency domain signal corresponding to the reconstructed power spectrum value to obtain the target audio signal.
  • the conversion from the frequency domain signal to the time domain signal may be performed by inverse discrete Fourier transform or improved inverse discrete cosine transform.
  • x out can be 20ms data.
  • the output of the current frame is the second half of the 10ms data in the 20ms data converted from the previous frequency domain signal to the time domain signal and the first half of the current frame. 10ms data The result of adding the result is used as the target audio signal outputted in the current frame.
  • the prediction device can complete the compensation and reconstruction of the first audio signal through the spectrum compensation model to obtain the target audio signal.
  • the neural network model obtained through model training can make compensation predictions for different equipment models and software versions uniformly, which solves the problem of limitations in the application scenarios of speech signal repair, thereby improving the generality of speech signal repair. sex.
  • FIG. 5 is a structural diagram of an audio signal processing method according to an exemplary embodiment, and the audio signal processing method can be executed by an audio processing device.
  • the above audio processing device may be the model training device 110 and the prediction device 120 in the system shown in FIG. 1 .
  • the audio signal processing method may include two stages, namely a model training stage and a model application stage.
  • the model training phase may be performed by offline training performed by the model training device, and the model training phase may include the following steps 51:
  • step 51 taking the spectrally damaged audio as the input and the corresponding original audio as the output target, offline training is performed on the deep neural network model, and the trained neural network model is updated.
  • the spectrum-impaired audio and the corresponding original audio may be sample data pre-stored in the sample set.
  • the model training device uses the spectrum-impaired audio as the input and the corresponding original audio as the output target.
  • the pair may be a recurrent neural network model RNN or a long
  • the spectral compensation model of the short-term memory network model LSTM is trained and updated.
  • the model application stage may be performed online by the prediction device, and the model application stage may include the following steps 52 to 59:
  • step 52 the prediction device collects the external audio signal through the audio collection function.
  • the module responsible for audio collection in the audio processing device may, for example, collect external audio through a microphone module to obtain audio signals.
  • step 53 the prediction device generates the audio signal after the speech enhancement process by subjecting the acquired audio signal to the software and hardware speech enhancement processing inherent in the prediction device.
  • the inherent software and hardware speech enhancement processing may include applying different speech enhancement algorithms to process the audio signal, and the generated audio signal after speech enhancement processing may have different degrees of signal damage.
  • step 54 the audio signal after the speech enhancement processing is a time-domain signal, and the time-domain signal is transformed into a frequency-domain signal through an operation.
  • the operation method for transforming the time domain signal into the frequency domain signal may be discrete Fourier transform or discrete cosine transform.
  • step 55 the frequency domain signal is input into a neural network model for prediction.
  • the samples at the input end and the target output end are the logarithm of the power spectrum, so it is necessary to calculate the logarithm value of the power spectrum corresponding to the frequency domain signal, and input the logarithm value of the power spectrum into the spectrum compensation model. forecast in.
  • the prediction result output by the neural network model is the predicted audio signal of the predicted repaired speech enhancement distorted signal, and the obtained predicted audio signal and the previously input frequency domain signal are subjected to sub-band damage rate analysis to obtain the current The subband frequency domain signal impairment rate of the signal.
  • the prediction result output by the spectrum compensation model is the logarithm value of the power spectrum corresponding to the predicted audio signal, and the subband damage rate can be analyzed by the logarithm value of the power spectrum and the logarithm value of the power spectrum corresponding to the input frequency domain signal.
  • the ratio is the historical smoothing value of the difference between the logarithmic value of the power spectrum and the logarithmic value of the power spectrum corresponding to the input frequency domain signal.
  • this step can be omitted, that is, the directly obtained predicted audio signal is used as a basis to perform the next step.
  • step 57 the damaged frequency band is reconstructed to generate a reconstructed frequency domain signal.
  • each input subband frequency domain signal is a damaged frequency band
  • the logarithm value of the power spectrum of the reconstructed frequency domain signal will be calculated and determined by means of the subband damage rate, or, when step 56 is omitted, the directly obtained
  • the logarithm of the power spectrum of the predicted audio signal is taken as the logarithm of the power spectrum of the reconstructed frequency domain signal.
  • step 58 the reconstructed frequency domain signal is converted into a time domain signal.
  • step 54 when the operation method for transforming the time-domain signal into the frequency-domain signal is discrete Fourier transform, the calculation method for converting the reconstructed frequency-domain signal into the time-domain signal may use inverse discrete Fourier transform If in step 54, the operation method for transforming the time domain signal into the frequency domain signal is discrete cosine transform, the calculation method for converting the reconstructed frequency domain signal into the time domain signal can use the inverse discrete cosine transform.
  • step 59 the time domain signal is output as the repaired target audio.
  • the time-domain signal converted from the reconstructed frequency-domain signal is output as the target audio signal, which may be played by the audio playback module of the prediction device, for example, by the speaker module to play the target audio.
  • the neural network model obtained through model training can make compensation predictions for different equipment models and software versions uniformly, which solves the problem of limitations in the application scenarios of speech signal repair, thereby improving the generality of speech signal repair. sex.
  • FIG. 6 is a schematic diagram of a voice restoration applied to a voice call system according to an exemplary embodiment.
  • the voice call system may include a voice transmitter 620 , a voice receiver 640 and a server for data transmission. 630.
  • the process of voice restoration in the voice call system may be as follows:
  • the voice transmitter 620 collects voice signals from the outside world, and performs inherent software and hardware voice enhancement processing on the collected voice signals in the voice transmitter 620. After the enhancement processing, the voice signal with certain damage is repaired, and the repaired enhanced voice signal is sent to the voice receiving end 640 through the server end 630, and the voice receiving end 640 uses its own voice playback module. play.
  • the spectrum compensation model in the voice transmitter 620 is obtained through offline model training by the model training device 610, and the voice transmitter 620 can download, install and update the trained spectrum compensation model in the application.
  • FIG. 7 is a schematic diagram of another voice restoration applied in a voice communication system according to an exemplary embodiment.
  • the voice communication system may include a voice transmitter 720 , a voice receiver 740 and a server for data transmission. end 730.
  • the process of voice restoration in the voice communication system may be as follows:
  • the voice transmitting end 720 collects the voice signal from the outside world, and performs inherent software and hardware voice enhancement processing on the collected voice signal in the voice transmitting end 720, and then enhances the voice signal with certain damage after the voice enhancement processing.
  • the server end 730 performs speech restoration processing on the speech signal with certain damage after the speech enhancement processing through the spectrum compensation model in the above-mentioned embodiment, and then sends the repaired enhanced speech signal to the speech receiving end 740.
  • the voice receiving end 640 plays the repaired enhanced voice signal through its own voice playing module.
  • the spectrum compensation model in the server end 730 is obtained through offline model training by the model training device 710, and the server end 730 can acquire the spectrum compensation model completed by the training to perform a certain damage on the speech signal after the speech enhancement processing. Speech repair processing.
  • the neural network model obtained through model training can make compensation predictions for different equipment models and software versions uniformly, which solves the problem of limitations in the application scenarios of speech signal repair, thereby improving the generality of speech signal repair. sex.
  • Fig. 8 is a structural block diagram of an audio signal processing apparatus according to an exemplary embodiment.
  • the audio signal processing method may be executed by an audio processing device to execute all or part of the steps in the method shown in the corresponding embodiment of FIG. 3 or FIG. 4 .
  • the audio signal processing device may include:
  • a signal acquisition module 810 configured to acquire a first audio signal
  • a result obtaining module 820 configured to process the first audio signal through a spectrum compensation model to obtain a prediction result of performing prediction compensation on the distorted spectrum in the first audio signal;
  • the spectrum compensation model is to use a spectrum distorted audio samples, and the neural network model obtained by training the original audio samples corresponding to the spectrally distorted audio samples;
  • the target obtaining module 830 is configured to reconstruct the first audio signal according to the prediction result, and obtain a target audio signal after repairing the distorted spectrum in the first audio signal.
  • the result obtaining module 820 includes:
  • a frequency-domain conversion submodule for converting the first audio signal into a corresponding frequency-domain signal
  • a subband dividing submodule configured to divide the frequency domain signal into at least one subband frequency domain signal
  • an amplitude determination submodule configured to determine the frequency point amplitude in the at least one subband frequency domain signal
  • a power spectrum value determination submodule configured to determine the power spectrum value of the at least one subband frequency domain signal according to the frequency point amplitude
  • the sub-band result obtaining sub-module is configured to input the power spectrum value of the at least one sub-band frequency domain signal into the spectrum compensation model, and obtain the sub-band prediction results corresponding to the at least one sub-band frequency domain signal respectively.
  • a prediction result of performing prediction compensation on the distortion spectrum in the first audio signal is determined according to the subband prediction result.
  • the frequency domain conversion submodule includes:
  • a windowing processing unit configured to perform frame-by-frame windowing processing on the first audio signal, and determine the time-domain signal processed by the first audio signal
  • a signal obtaining unit configured to perform frequency domain conversion on the processed time domain signal to obtain the corresponding frequency domain signal.
  • the signal acquisition unit is configured to:
  • the corresponding frequency-domain signal is obtained by performing an improved discrete cosine transform on the processed time-domain signal.
  • the sub-band division sub-module includes:
  • a subband dividing unit configured to divide the frequency domain signal into the at least one subband frequency domain signal with the Bark domain as the granularity.
  • the sub-band result acquisition sub-module includes:
  • a subband result acquisition unit configured to input the power spectrum value corresponding to the at least one subband frequency domain signal into the spectrum compensation model, and obtain the predicted power spectrum value corresponding to the at least one subband frequency domain signal, as a subband prediction result of the at least one subband frequency domain signal.
  • the target acquisition module 830 includes:
  • a power spectrum value acquisition sub-module configured to acquire the reconstructed power spectrum value of the first audio signal according to the prediction result
  • a target generation sub-module configured to generate the target audio signal corresponding to the reconstructed power spectrum value.
  • the target acquisition module 830 includes:
  • a power spectrum value generation sub-module configured to use the sum of the power spectrum value corresponding to the first audio signal and the frequency band impairment rate as the reconstructed power spectrum value;
  • the frequency band impairment rate is the The predicted power spectrum value corresponding to the frequency domain signal respectively, and the historical smoothing value of the difference between the power spectrum value corresponding to the first audio signal;
  • the power spectrum value determination submodule is configured to use the predicted power spectrum values corresponding to the at least one subband frequency domain signal respectively as the reconstructed power spectrum value.
  • the target generation submodule includes:
  • a time domain conversion unit configured to perform time domain transformation on the frequency domain signal corresponding to the reconstructed power spectrum value to obtain the target audio signal.
  • the signal acquisition module 810 includes:
  • the signal acquisition sub-module is used for acquiring the first audio signal after audio enhancement processing.
  • the apparatus further includes:
  • a sample acquisition module configured to process the first audio signal through a spectrum compensation model, and acquire the original audio sample before obtaining a prediction result of performing prediction compensation on the distortion spectrum in the first audio signal;
  • a distortion sample acquisition module configured to perform power spectrum value suppression processing on a part of the frequency band in the frequency domain signal corresponding to the original audio sample, to obtain a spectrally distorted audio sample corresponding to the original audio sample;
  • a model acquisition module configured to take the spectrally distorted audio samples as input, and use the original audio samples as training targets to perform machine learning training to obtain the spectral compensation model.
  • the spectral compensation model is RNN or LSTM.
  • the neural network model obtained through model training can make compensation predictions for different equipment models and software versions uniformly, which solves the problem of limitations in the application scenarios of speech signal repair, thereby improving the generality of speech signal repair. sex.
  • Fig. 9 is a schematic structural diagram of a computer device according to an exemplary embodiment.
  • the computer device may be implemented as an audio processing device.
  • the audio processing device may include the model training device 110 and the prediction device 120 shown in FIG. 1 .
  • the computer device 900 includes a central processing unit (Central Processing Unit, CPU) 901, a system memory 904 including a random access memory (Random Access Memory, RAM) 902 and a read-only memory (Read-Only Memory, ROM) 903, and A system bus 905 that connects the system memory 904 and the central processing unit 901 .
  • CPU Central Processing Unit
  • RAM random access memory
  • ROM Read-Only Memory
  • the computer device 900 also includes a basic input/output system (Input/Output, I/O system) 906 that helps to transfer information between various devices in the computer, and is used to store an operating system 913, application programs 914 and other program modules 915 of the mass storage device 907.
  • I/O system input/output system
  • the mass storage device 907 is connected to the central processing unit 901 through a mass storage controller (not shown) connected to the system bus 905 .
  • the mass storage device 907 and its associated computer-readable media provide non-volatile storage for the computer device 900 . That is, the mass storage device 907 may include a computer-readable medium (not shown) such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
  • a computer-readable medium such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
  • the computer-readable media can include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, Erasable Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), CD-ROM, Digital Video Disc (DVD) or other optical storage, cassette, magnetic tape, magnetic disk storage or other magnetic storage device.
  • EPROM Erasable Programmable Read Only Memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • CD-ROM Compact Disc
  • DVD Digital Video Disc
  • the computer device 900 can be connected to the Internet or other network devices through a network interface unit 911 connected to the system bus 905 .
  • the memory also includes one or more programs, the one or more programs are stored in the memory, and the central processing unit 901 implements all or all of the methods shown in FIG. 3 or FIG. 4 by executing the one or more programs. some steps.
  • non-transitory computer-readable storage medium including instructions, such as a memory including a computer program (instructions) executable by a processor of a computer device to complete the present application All or part of the steps of the method shown in each embodiment.
  • the non-transitory computer-readable storage medium may be a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a compact disc read-only memory (Compact Disc Read-Only Memory, CD-ROM), magnetic tapes, floppy disks, and optical data storage devices, etc.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the audio signal processing methods provided in various optional implementations of the above aspects.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

L'invention porte sur un procédé et sur un appareil de traitement de signal audio, sur un dispositif informatique, sur un support de stockage lisible par ordinateur et sur un produit programme d'ordinateur, se rapportant au domaine technique de l'intelligence artificielle. Ledit procédé consiste : à acquérir un premier signal audio (301) ; à traiter le premier signal audio au moyen d'un modèle de compensation de spectre et à obtenir un résultat de prédiction consistant à effectuer une compensation prédictive sur un spectre de distorsion dans le premier signal audio (302) ; et à reconstruire le premier signal audio selon le résultat de prédiction et à obtenir un signal audio cible obtenu par restauration du spectre de distorsion dans le premier signal audio (303). Sur la base du concept d'intelligence artificielle (AI), un modèle de réseau neuronal obtenu au moyen d'une formation de modèle peut effectuer de manière uniforme une compensation et une prédiction pour différents modèles de dispositif et pour différentes versions de logiciel, résolvant le problème de limitation existant dans un scénario d'application de restauration de signal vocal, ce qui permet d'améliorer l'universalité d'une restauration de signal vocal.
PCT/CN2021/097663 2020-07-13 2021-06-01 Procédé de traitement de signal audio et appareil associé WO2022012195A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010670626.9A CN112820315B (zh) 2020-07-13 2020-07-13 音频信号处理方法、装置、计算机设备及存储介质
CN202010670626.9 2020-07-13

Publications (1)

Publication Number Publication Date
WO2022012195A1 true WO2022012195A1 (fr) 2022-01-20

Family

ID=75853211

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097663 WO2022012195A1 (fr) 2020-07-13 2021-06-01 Procédé de traitement de signal audio et appareil associé

Country Status (2)

Country Link
CN (1) CN112820315B (fr)
WO (1) WO2022012195A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114866856A (zh) * 2022-05-06 2022-08-05 北京达佳互联信息技术有限公司 音频信号的处理方法、音频生成模型的训练方法及装置
CN114974299A (zh) * 2022-08-01 2022-08-30 腾讯科技(深圳)有限公司 语音增强模型的训练、增强方法、装置、设备、介质
CN116248229A (zh) * 2022-12-08 2023-06-09 南京龙垣信息科技有限公司 一种面向实时语音通讯的丢包补偿方法
CN117395181A (zh) * 2023-12-12 2024-01-12 方图智能(深圳)科技集团股份有限公司 基于物联网的低延时多媒体音频传输检测方法及系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112820315B (zh) * 2020-07-13 2023-01-06 腾讯科技(深圳)有限公司 音频信号处理方法、装置、计算机设备及存储介质
CN113612808B (zh) * 2021-10-09 2022-01-25 腾讯科技(深圳)有限公司 音频处理方法、相关设备、存储介质及程序产品
CN114822567B (zh) * 2022-06-22 2022-09-27 天津大学 一种基于能量算子的病理嗓音频谱重构方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101512938A (zh) * 2006-08-01 2009-08-19 Dts(英属维尔京群岛)有限公司 用于补偿音频变换器的线性和非-线性失真的神经网络滤波技术
CN107112025A (zh) * 2014-09-12 2017-08-29 美商楼氏电子有限公司 用于恢复语音分量的系统和方法
CN107895571A (zh) * 2016-09-29 2018-04-10 亿览在线网络技术(北京)有限公司 无损音频文件识别方法及装置
CN109147805A (zh) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 基于深度学习的音频音质增强
CN109686381A (zh) * 2017-10-19 2019-04-26 恩智浦有限公司 用于信号增强的信号处理器和相关方法
US20200090676A1 (en) * 2018-09-17 2020-03-19 Honeywell International Inc. System and method for audio noise reduction
CN112820315A (zh) * 2020-07-13 2021-05-18 腾讯科技(深圳)有限公司 音频信号处理方法、装置、计算机设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101512938A (zh) * 2006-08-01 2009-08-19 Dts(英属维尔京群岛)有限公司 用于补偿音频变换器的线性和非-线性失真的神经网络滤波技术
CN107112025A (zh) * 2014-09-12 2017-08-29 美商楼氏电子有限公司 用于恢复语音分量的系统和方法
CN107895571A (zh) * 2016-09-29 2018-04-10 亿览在线网络技术(北京)有限公司 无损音频文件识别方法及装置
CN109686381A (zh) * 2017-10-19 2019-04-26 恩智浦有限公司 用于信号增强的信号处理器和相关方法
CN109147805A (zh) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 基于深度学习的音频音质增强
US20200090676A1 (en) * 2018-09-17 2020-03-19 Honeywell International Inc. System and method for audio noise reduction
CN112820315A (zh) * 2020-07-13 2021-05-18 腾讯科技(深圳)有限公司 音频信号处理方法、装置、计算机设备及存储介质

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114866856A (zh) * 2022-05-06 2022-08-05 北京达佳互联信息技术有限公司 音频信号的处理方法、音频生成模型的训练方法及装置
CN114866856B (zh) * 2022-05-06 2024-01-02 北京达佳互联信息技术有限公司 音频信号的处理方法、音频生成模型的训练方法及装置
CN114974299A (zh) * 2022-08-01 2022-08-30 腾讯科技(深圳)有限公司 语音增强模型的训练、增强方法、装置、设备、介质
CN114974299B (zh) * 2022-08-01 2022-10-21 腾讯科技(深圳)有限公司 语音增强模型的训练、增强方法、装置、设备、介质
WO2024027295A1 (fr) * 2022-08-01 2024-02-08 腾讯科技(深圳)有限公司 Procédé et appareil de formation de modèle d'amélioration de la parole, procédé d'amélioration, dispositif électronique, support de stockage et produit programme
CN116248229A (zh) * 2022-12-08 2023-06-09 南京龙垣信息科技有限公司 一种面向实时语音通讯的丢包补偿方法
CN116248229B (zh) * 2022-12-08 2023-12-01 南京龙垣信息科技有限公司 一种面向实时语音通讯的丢包补偿方法
CN117395181A (zh) * 2023-12-12 2024-01-12 方图智能(深圳)科技集团股份有限公司 基于物联网的低延时多媒体音频传输检测方法及系统
CN117395181B (zh) * 2023-12-12 2024-02-13 方图智能(深圳)科技集团股份有限公司 基于物联网的低延时多媒体音频传输检测方法及系统

Also Published As

Publication number Publication date
CN112820315A (zh) 2021-05-18
CN112820315B (zh) 2023-01-06

Similar Documents

Publication Publication Date Title
WO2022012195A1 (fr) Procédé de traitement de signal audio et appareil associé
KR101266894B1 (ko) 특성 추출을 사용하여 음성 향상을 위한 오디오 신호를 프로세싱하기 위한 장치 및 방법
JP6903611B2 (ja) 信号生成装置、信号生成システム、信号生成方法およびプログラム
US8484020B2 (en) Determining an upperband signal from a narrowband signal
KR20230013054A (ko) 심층 신경망을 사용하는 시변 및 비선형 오디오 처리
US10141008B1 (en) Real-time voice masking in a computer network
CN113345460B (zh) 音频信号处理方法、装置、设备及存储介质
WO2022166710A1 (fr) Appareil et procédé d'amélioration de la parole, dispositif et support de stockage
CN112712816A (zh) 语音处理模型的训练方法和装置以及语音处理方法和装置
CN114333893A (zh) 一种语音处理方法、装置、电子设备和可读介质
JP2000069597A (ja) インパルス応答測定方法
CN109741761B (zh) 声音处理方法和装置
Hammam et al. Blind signal separation with noise reduction for efficient speaker identification
CN110875037A (zh) 语音数据处理方法、装置及电子设备
CN112233693B (zh) 一种音质评估方法、装置和设备
WO2022166738A1 (fr) Procédé et appareil d'amélioration de parole, dispositif et support de stockage
Schröter et al. CLC: complex linear coding for the DNS 2020 challenge
CN113593604A (zh) 检测音频质量方法、装置及存储介质
CN114333892A (zh) 一种语音处理方法、装置、电子设备和可读介质
CN111326166B (zh) 语音处理方法及装置、计算机可读存储介质、电子设备
CN114333891A (zh) 一种语音处理方法、装置、电子设备和可读介质
Campbell et al. Feature extraction of automatic speaker recognition, analysis and evaluation in real environment
Lajmi An improved packet loss recovery of audio signals based on frequency tracking
JP6827908B2 (ja) 音源強調装置、音源強調学習装置、音源強調方法、プログラム
CN113436644B (zh) 音质评估方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21843222

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21843222

Country of ref document: EP

Kind code of ref document: A1