WO2023240887A1 - 去混响方法、装置、设备及存储介质 - Google Patents

去混响方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023240887A1
WO2023240887A1 PCT/CN2022/128051 CN2022128051W WO2023240887A1 WO 2023240887 A1 WO2023240887 A1 WO 2023240887A1 CN 2022128051 W CN2022128051 W CN 2022128051W WO 2023240887 A1 WO2023240887 A1 WO 2023240887A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
frequency domain
reverberation
domain signal
speech
Prior art date
Application number
PCT/CN2022/128051
Other languages
English (en)
French (fr)
Inventor
刘建国
郝斌
Original Assignee
青岛海尔科技有限公司
海尔智家股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 青岛海尔科技有限公司, 海尔智家股份有限公司 filed Critical 青岛海尔科技有限公司
Publication of WO2023240887A1 publication Critical patent/WO2023240887A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • H04M9/082Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic using echo cancellers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • the present application relates to the technical field of audio signal processing, and in particular to a dereverberation method, device, equipment and storage medium.
  • Reverberation suppression methods based on deep neural networks in the existing technology, such as the fully convolutional time domain audio separation network Conv-TasNet, use the model output results as the final output to obtain dereverberated speech.
  • Conv-TasNet fully convolutional time domain audio separation network
  • the model output result is directly used as the final output, but the output speech distortion is relatively large, which is not conducive to subsequent speech recognition and affects speech arousal.
  • This application provides a dereverberation method, device, equipment and storage medium to solve the problem of directly using the model output result as the final output, which is not conducive to subsequent speech recognition.
  • this application provides a dereverberation method, including:
  • the voice frequency domain signal is a voice frequency domain signal including a reverberation frequency domain signal, and determine the corresponding voice frequency domain characteristic signal according to the voice frequency domain signal, the voice frequency domain signal
  • the characteristic signal is a speech domain characteristic signal including a reverberation frequency domain signal
  • the voice domain signal is filtered based on the estimated reverberation frequency domain signal to obtain a dereverberated voice domain signal.
  • this application provides a dereverberation device, including:
  • a determining unit configured to obtain a voice domain signal to be processed, the voice domain signal being a voice domain signal including a reverberation frequency domain signal, and determine the corresponding voice domain feature signal based on the voice domain signal,
  • the voice frequency domain characteristic signal is a voice frequency domain characteristic signal including a reverberation frequency domain signal;
  • a processing unit configured to input the speech frequency domain characteristic signal into a preset neural network model and output the reverberation suppressed speech frequency domain characteristic signal
  • the determination unit is also used to determine the corresponding estimated reverberation frequency domain signal based on the reverberation suppression speech frequency domain characteristic signal;
  • a filtering unit configured to perform filtering processing on the voice domain signal based on the estimated reverberation frequency domain signal to obtain a dereverberated voice domain signal.
  • this application provides an electronic device, including: a processor, and a memory communicatively connected to the processor;
  • the memory stores computer execution instructions
  • the processor executes computer execution instructions stored in the memory, so that the processor executes the method described in the first aspect.
  • the present application provides a computer-readable storage medium.
  • Computer-executable instructions are stored in the computer-readable storage medium. When executed by a processor, the computer-executable instructions are used to implement the method as described in the first aspect. .
  • the present application provides a dereverberation method, device, equipment and storage medium, by obtaining the voice frequency domain signal to be processed, the voice frequency domain signal is a voice frequency domain signal including a reverberation frequency domain signal, and based on the The speech frequency domain signal determines the corresponding speech frequency domain characteristic signal, and the speech frequency domain characteristic signal is a speech frequency domain characteristic signal including a reverberation frequency domain signal; the speech frequency domain characteristic signal is input to a preset neural network In the model, the reverberation suppression speech frequency domain characteristic signal is output; the corresponding estimated reverberation frequency domain signal is determined based on the reverberation suppression speech frequency domain characteristic signal; the speech frequency domain signal is filtered based on the estimated reverberation frequency domain signal. Processing to obtain a dereverberated speech domain signal.
  • Figure 1 is a schematic diagram of the application scenario of the dereverberation method provided by this application.
  • Figure 2 is a schematic flow chart of the dereverberation method provided in Embodiment 1 of the present application;
  • Figure 3 is a schematic flow chart of the dereverberation method provided in Embodiment 2 of the present application.
  • Figure 4 is a schematic flowchart of the dereverberation method provided in Embodiment 3 of the present application.
  • FIG. 5 is a schematic flowchart of the dereverberation method provided in Embodiment 4 of the present application.
  • Figure 6 is a schematic flow chart of the dereverberation method provided in Embodiment 7 of the present application.
  • Figure 7 is a schematic structural diagram of a dereverberation device provided by an embodiment of the present application.
  • Figure 8 is a first block diagram of an electronic device used to implement the dereverberation method according to an embodiment of the present application
  • Figure 9 is a second block diagram of an electronic device used to implement the dereverberation method according to an embodiment of the present application.
  • Conv-TasNet uses a linear encoder to generate speech waveforms that are optimized to separate individual speaker sounds. Speaker voice separation is achieved by applying a set of weighting functions (masks) to the encoder output, and the model output is used as the final output to obtain dereverberated speech.
  • masks weighting functions
  • the model output is directly used as the final output, the output speech distortion is relatively large, which is not conducive to subsequent speech recognition and affects speech arousal.
  • the inventor found in the research that combining the neural network with filtering can obtain the speech audio domain signal to be processed.
  • the speech audio domain The signal is a speech frequency domain signal containing a reverberation frequency domain signal, and the corresponding speech frequency domain characteristic signal is determined according to the speech frequency domain signal.
  • the speech frequency domain characteristic signal is a speech frequency domain characteristic signal including a reverberation frequency domain signal; the speech frequency domain signal is The audio domain characteristic signal is input into the preset neural network model, and the reverberation suppressed speech audio domain characteristic signal is output; the corresponding estimated reverberation frequency domain signal is determined based on the reverberation suppression speech audio domain characteristic signal; based on the estimated reverberation frequency
  • the voice domain signal is filtered to obtain a dereverberated voice domain signal. Combining the neural network model and filtering processing, and filtering the output results of the neural network, can effectively reduce the distortion of speech and improve the subsequent speech arousal rate and recognition rate.
  • the user inputs a voice signal, and the incoming human voice and the reverberation time domain signal are mixed to form a voice time domain signal.
  • the smart speaker 1 obtains the speech time domain signal containing the reverberation time domain signal input by the microphone; performs sampling processing on the input speech time domain signal containing the reverberation time domain signal according to the preset sampling strategy, and obtains the to-be-processed speech time domain signal containing the reverberation time domain signal.
  • the speech time domain signal of the domain signal; the speech time domain signal containing the reverberation time domain signal to be processed is Fourier transformed to obtain the speech audio domain signal to be processed, and the speech audio domain signal is the speech domain signal containing the reverberation frequency domain signal.
  • the smart speaker 1 obtains the voice domain signal to be processed, which is a voice domain signal including a reverberation frequency domain signal, and determines the corresponding voice domain feature signal based on the voice domain signal.
  • the voice domain feature The signal is a speech frequency domain characteristic signal including a reverberation frequency domain signal; the speech frequency domain characteristic signal is input into the preset neural network model, and the reverberation suppressed speech frequency domain characteristic signal is output; the speech frequency domain characteristic signal is suppressed according to the reverberation Determine the corresponding estimated reverberation frequency domain signal; filter the voice domain signal based on the estimated reverberation frequency domain signal to obtain a dereverberated voice domain signal; convert the dereverberated voice domain signal into Dereverberate the speech time domain signal, and perform speech recognition processing on the dereverberated speech time domain signal.
  • Combining the neural network model and filtering processing, and filtering the output results of the neural network can effectively reduce the distortion of speech and improve the subsequent speech arousal rate and recognition rate
  • FIG 2 is a schematic flow chart of the de-reverberation method provided in Embodiment 1 of the present application.
  • the execution subject of the de-reverberation method provided in this embodiment is a de-reverberation device.
  • the de-reverberation device is located in the electronic equipment.
  • the dereverberation method provided by this embodiment includes the following steps:
  • Step 101 Obtain the speech frequency domain signal to be processed.
  • the speech frequency domain signal is a speech frequency domain signal including a reverberation frequency domain signal, and determine the corresponding speech frequency domain characteristic signal according to the speech frequency domain signal.
  • the speech frequency domain characteristic signal It is the speech frequency domain characteristic signal containing the reverberation frequency domain signal.
  • a voice domain signal to be processed is obtained, where the voice domain signal is a voice domain signal including a reverberation frequency domain signal.
  • Feature extraction is further performed on the voice domain signal to be processed to obtain the corresponding voice domain signal.
  • Domain feature signal wherein the voice domain feature signal is a voice domain feature signal including a reverberation frequency domain signal.
  • features include Bark domain, MFCC, Fbank, etc.
  • Step 102 Input the speech frequency domain characteristic signal into the preset neural network model, and output the reverberation suppression speech frequency domain characteristic signal.
  • the preset neural network model includes a one-dimensional convolution layer, an LSTM layer, a linear layer and an activation layer.
  • the speech audio domain feature signal is input into the preset neural network model and the reverberation suppressed speech audio domain is output. characteristic signal.
  • the convolution layer uses convolution kernels for feature extraction and feature mapping.
  • the LSTM layer is a variant of the SimpleRNN layer, which adds a method of carrying information across multiple time steps.
  • the linear layer is also called a fully connected layer. In the fully connected layer, all neurons have weighted connections. Usually the fully connected layer is at the end of the convolutional neural network. After the front convolutional layer captures enough features to identify the image, the next step is how to classify. Usually, at the end of the convolutional network, the cuboid obtained at the end is flattened into a long vector and sent to the fully connected layer to cooperate with the output layer for classification.
  • the activation layer is an activation function that continues non-linear transformation of features, giving the multi-layer neural network a deep meaning.
  • Step 103 Determine the corresponding estimated reverberation frequency domain signal based on the reverberation suppression speech frequency domain characteristic signal.
  • the reverberation suppression speech frequency domain characteristic signal is processed to obtain the gain speech frequency domain characteristic signal, and the estimated reverberation frequency is determined based on the reverberation suppression speech frequency domain characteristic signal and the corresponding gain speech frequency domain characteristic signal.
  • domain signal, and the estimated reverberation frequency domain signal is the estimated reverberation component.
  • Step 104 Filter the voice domain signal based on the estimated reverberation frequency domain signal to obtain a dereverberated voice domain signal.
  • the speech domain signal containing the reverberation frequency domain signal to be processed is filtered based on the estimated reverberation frequency domain signal.
  • the normalized minimum mean square error can be used for filtering processing, thereby obtaining the dereverberation result. Speech domain signal.
  • the speech audio domain signal to be processed is obtained, the corresponding speech audio domain characteristic signal is further determined based on the speech audio domain signal to be processed, and the speech audio domain characteristic signal is output to a preset neural network model, and the output Reverberation suppresses the voice domain characteristic signal to determine the estimated reverberation frequency domain signal based on the reverberation suppression voice domain characteristic signal, and performs filtering on the voice domain signal to be processed based on the estimated reverberation frequency domain signal to obtain demixing Loud voice domain signals.
  • the reverberation component is estimated based on the output result of the neural network model, and filtering is performed based on the reverberation component. Combining the neural network model and filtering processing, and filtering the output results of the neural network, can effectively reduce the distortion of speech and improve the subsequent speech arousal rate and recognition rate.
  • FIG. 3 is a schematic flow chart of the dereverberation method provided in Embodiment 2 of the present application. As shown in Figure 3, on the basis of the dereverberation method provided in Embodiment 1 of the present application, step 103 is further refined. Specifically, Includes the following steps:
  • Step 1031 Obtain the corresponding gain voice domain feature signal based on the reverberation suppression voice domain feature signal.
  • the conversion of the reverberation suppression speech frequency domain characteristic signal mainly involves converting the 64-dimensional reverberation suppression speech frequency domain characteristic signal into a 257-dimensional reverberation suppression speech frequency domain characteristic signal, and obtaining the converted reverberation
  • the speech frequency domain characteristic signal is suppressed, and the converted speech frequency domain characteristic signal is the corresponding gain speech frequency domain characteristic signal.
  • Step 1032 Determine the corresponding estimated reverberation frequency domain signal based on the reverberation suppression voice frequency domain feature signal and the corresponding gain voice frequency domain feature signal.
  • the corresponding estimated reverberation frequency domain signal is calculated based on the reverberation suppression voice frequency domain characteristic signal and the corresponding gain voice frequency domain characteristic signal.
  • the estimated reverberation frequency domain signal can be regarded as the estimated reverberation. Element.
  • the voice domain signal of the signal to be processed is further filtered based on the estimated reverberation frequency domain signal, thereby removing the reverberation component in the voice domain signal to be processed, and obtaining a voice domain signal that does not contain reverberation.
  • the estimated reverberation frequency domain signal can be determined by calculating the gain voice frequency domain characteristic signal according to the reverberation suppression voice frequency domain characteristic signal output by the neural network model, and the estimated reverberation component can be directly obtained.
  • Figure 4 is a schematic flow chart of the dereverberation method provided in Embodiment 3 of the present application. As shown in Figure 4, on the basis of the dereverberation method provided in Embodiment 2 of the present application, step 1032 is further refined. Specifically, Includes the following steps:
  • Step 1032a Calculate the difference between the preset gain signal and the gain voice domain characteristic signal to obtain the first voice domain signal.
  • the preset gain signal and the gain voice domain characteristic signal are substituted into formula (1), and the difference between the preset gain signal and the gain voice domain characteristic signal is calculated.
  • Formula (1) is expressed as:
  • Y is the first voice domain signal
  • A is the preset gain signal
  • M is the gain voice domain characteristic signal
  • the value of A is 1.
  • Step 1032b Multiply the reverberation suppression voice domain feature signal and the first voice domain signal to obtain the corresponding estimated reverberation frequency domain signal.
  • Formula (2) is expressed as:
  • B is the estimated reverberation frequency domain signal
  • N is the reverberation suppression speech frequency domain characteristic signal
  • A is the preset gain signal
  • M is the gain speech frequency domain characteristic signal, where the value of A is 1.
  • the estimated reverberation frequency domain signal can be obtained through the gain part and the reverberation suppression voice domain characteristic signal, and the reverberation component can be accurately estimated.
  • step 104 is further refined, specifically including the following steps:
  • Step 1041 Use the normalized minimum mean square error algorithm to determine the calibrated reverberation frequency domain signal corresponding to the estimated reverberation frequency domain signal, and determine the correct reverberation frequency domain signal based on the corresponding calibrated reverberation frequency domain signal and speech audio domain signal. Reverberated speech domain signal.
  • the normalized least mean square error algorithm (NLMS, Normalized Least Mean Square) is used to determine the calibrated reverberation frequency domain signal corresponding to the estimated reverberation frequency domain signal, and the normalized least mean square error algorithm is used Perform filtering.
  • the order of the NLMS filter can range from 3 to 10.
  • the order of the filter refers to the number of harmonics filtered. Generally speaking, the higher the order of the same filter, the better the filtering effect. However, the higher the order, the greater the corresponding calculation amount, so the order of the NLMS filter can be set to 5, or other suitable values.
  • the estimated reverberation frequency domain signal is an estimated reverberation frequency domain signal. It is necessary to calibrate the estimated reverberation frequency domain signal. First, calculate the calibrated reverberation frequency domain signal corresponding to the estimated reverberation frequency domain signal.
  • the formula ( 3) Expressed as:
  • y is the calibrated reverberation frequency domain signal
  • w is the filter coefficient
  • x estimates the reverberation frequency domain signal
  • the dereverberated voice domain signal is determined according to the corresponding calibrated reverberation frequency domain signal and the speech audio domain signal, and the calibrated reverberation frequency domain signal and the to-be-processed speech audio signal including the reverberation frequency domain signal are The domain signal is substituted into formula (4) to obtain the dereverberated voice domain signal.
  • e is the dereverberated voice domain signal
  • d is the voice domain signal containing the reverberation frequency domain signal to be processed
  • k is the calibrated reverberation frequency domain signal.
  • the normalized minimum mean square error algorithm can be used to effectively remove the reverberation component in the speech domain signal including the reverberation frequency domain signal.
  • a more accurate reverberation component can be obtained.
  • the reverberation component can be used to obtain a clean dereverberated speech domain signal based on the calibrated reverberation frequency domain signal, which can effectively improve the accuracy of subsequent recognition.
  • step 104 Based on the dereverberation method provided in Embodiment 1 of the present application, after step 104, the following steps are also included:
  • Step 105 Convert the dereverberated speech domain signal into a dereverberated speech time domain signal, and perform speech recognition processing on the dereverberated speech time domain signal.
  • the inverse Fourier transform is used to convert the dereverberated speech domain signal into the dereverberated speech time domain signal, thereby performing speech recognition processing on the dereverberated speech time domain signal and obtaining the recognition result.
  • the speech recognition accuracy can be effectively improved.
  • Figure 5 is a schematic flow chart of the dereverberation method provided in Embodiment 6 of the present application. As shown in Figure 5, on the basis of the dereverberation method provided in Embodiment 1 of the present application, step 102 is further refined. Specifically, Includes the following steps:
  • Step 1021 Perform Bark domain feature extraction on the speech audio domain signal to obtain the corresponding speech audio domain Bark domain feature signal.
  • feature extraction is performed on the voice domain feature signal containing the reverberation frequency domain signal.
  • Bark domain feature extraction is performed on the voice domain feature signal containing the reverberation frequency domain signal to obtain the corresponding voice Bark domain.
  • the Bark domain has an amplifying effect on low frequencies and a compression effect on high frequencies.
  • Step 1022 Determine the corresponding voice domain Bark domain feature signal as the corresponding voice domain feature signal including a reverberation frequency domain signal.
  • the corresponding voice domain Bark domain feature signal is determined as the corresponding voice domain feature signal including a reverberation frequency domain signal.
  • the Bark domain is more consistent with the auditory masking effect of the human ear than the linear frequency domain.
  • the Bark domain has the amplification effect on low frequencies and the compression effect on high frequencies. It can more clearly reveal which signals are prone to masking and which noise is more obvious, which can improve accuracy.
  • Figure 6 is a schematic flowchart of the dereverberation method provided in Embodiment 7 of the present application. As shown in Figure 6, based on the dereverberation method provided in Embodiment 1 to Embodiment 6 of the present application, before step 101, the following is also included: step:
  • Step 101a Obtain the speech time domain signal including the reverberation time domain signal input by the microphone.
  • the electronic device may be a smart speaker, and the speech time domain signal including the reverberation time domain signal input by the smart speaker microphone is acquired.
  • Step 101b Sampling the input speech time domain signal containing the reverberation time domain signal according to the preset sampling strategy to obtain the speech time domain signal containing the reverberation time domain signal to be processed.
  • the preset sampling strategy includes sampling frequency and sampling length.
  • the sampling frequency is 16k and the sampling length is 512.
  • the input speech time domain signal including the reverberation time domain signal is sampled and processed according to the preset sampling strategy. Obtain the speech time domain signal containing the reverberation time domain signal to be processed.
  • Step 101c Fourier transform is performed on the speech time domain signal containing the reverberation time domain signal to be processed to obtain the speech audio domain signal to be processed.
  • the speech time domain signal containing the reverberation time domain signal to be processed is Fourier transformed.
  • the short-time Fourier transform can be used.
  • the short-time Fourier transform (STFT, short-timeFouriertransform, or short-termFouriertransform)) is a mathematical transformation related to the Fourier transform, which is used to determine the frequency and phase of the sine wave in the local area of the time-varying signal. Converting time domain signals into frequency domain signals enables better analysis of speech signals.
  • step 102 Based on the dereverberation method provided in Embodiment 1 to Embodiment 6 of this application, before step 102, the following steps are also included:
  • Step 102a Obtain pre-constructed training data.
  • the pre-constructed training data includes: a plurality of speech frequency domain signals containing reverberation frequency domain signals and a plurality of speech frequency domain signals excluding reverberation frequency domain signals.
  • pre-constructed training data is obtained.
  • the pre-constructed training data includes a plurality of speech-frequency domain signals containing reverberation frequency-domain signals and a plurality of speech-frequency domain signals without reverberation frequency-domain signals, where no reverberation frequency-domain signals are included.
  • the voice domain signal that echoes the frequency domain signal is obtained through collection, that is, direct sound.
  • the speech frequency domain signal containing the reverberation frequency domain signal is a signal obtained by convolution and impact of the speech frequency domain signal without the reverberation frequency domain signal. It is an analog signal and is generated using the rir tool.
  • Analog signals simulated with room size, sound source and microphone.
  • the ratio of the number of samples of the voice domain signal including the reverberation frequency domain signal to the number of samples of the plurality of voice domain signals not containing the reverberation frequency domain signal is 8:2.
  • Step 102b Use the pre-constructed training data to train the neural network model to obtain the trained neural network model, and determine the trained neural network model as a preset neural network model.
  • the pre-built neural network template includes a one-dimensional convolution layer, an LSTM layer, a linear layer and an activation layer.
  • LSTM Long Short-Term Memory
  • the activation function used in the activation layer is the sigmoid function.
  • the Sigmoid function is a common S-shaped function in biology, also known as the S-shaped growth curve. In information science, due to its single-increasing and inverse function single-increasing properties, the Sigmoid function is often used as the activation function of neural networks to map variables between 0 and 1. In addition, a loss function needs to be defined.
  • a number is passed along the neural network, and then the difference between this number and the actual number you want to get is then squared.
  • the calculated value is the difference between the predicted value and the actual number.
  • the distance between values, and training a neural network is to reduce this distance or loss function.
  • the neural network model is trained using pre-constructed data to obtain the trained neural network model, and the trained neural network model is determined as a preset neural network model. Through the training of the neural network model, the output of the neural network model is more consistent with reality.
  • FIG. 7 is a schematic structural diagram of a dereverberation device provided by an embodiment of the present application.
  • the dereverberation device 200 provided by this embodiment includes a determination unit 201 , a processing unit 202 , and a filtering unit 203 .
  • the determining unit 201 is used to obtain the speech frequency domain signal to be processed, the speech frequency domain signal is a speech frequency domain signal including a reverberation frequency domain signal, and determine the corresponding speech frequency domain characteristic signal according to the speech frequency domain signal.
  • the audio domain characteristic signal is a voice domain characteristic signal including a reverberation frequency domain signal.
  • the processing unit 202 is configured to input the speech frequency domain characteristic signal into a preset neural network model and output the reverberation suppressed speech frequency domain characteristic signal.
  • the determination unit 201 is also configured to determine the corresponding estimated reverberation frequency domain signal according to the reverberation suppression speech frequency domain characteristic signal.
  • the filtering unit 203 is configured to filter the voice domain signal based on the estimated reverberation frequency domain signal to obtain a dereverberated voice domain signal.
  • the determining unit is also configured to obtain the corresponding gain speech frequency domain characteristic signal according to the reverberation suppression speech frequency domain characteristic signal; and determine the corresponding gain speech frequency domain characteristic signal according to the reverberation suppression speech frequency domain characteristic signal and the corresponding gain speech frequency domain characteristic signal. Predict reverberation frequency domain signals.
  • the determination unit is also used to calculate the difference between the preset gain signal and the gain voice domain characteristic signal to obtain the first voice domain signal; and combine the reverberation suppression voice domain characteristic signal with the first voice domain signal. Multiply to obtain the corresponding estimated reverberation frequency domain signal.
  • the filtering unit is also used to determine the calibrated reverberation frequency domain signal corresponding to the estimated reverberation frequency domain signal using the normalized minimum mean square error algorithm, and based on the corresponding calibrated reverberation frequency domain signal and the voice domain signal to determine the dereverberated voice domain signal.
  • the dereverberation device also includes: an identification unit.
  • the recognition unit is used to convert the dereverberated speech domain signal into a dereverberated speech time domain signal, and perform speech recognition processing on the dereverberated speech time domain signal.
  • the determining unit is also configured to perform Bark domain feature extraction on the speech audio domain signal to obtain the corresponding speech audio domain Bark domain characteristic signal; and determine the corresponding speech audio domain Bark domain characteristic signal as the corresponding speech audio domain signal containing the reverberation frequency.
  • the speech domain characteristic signal of the domain signal is also configured to perform Bark domain feature extraction on the speech audio domain signal to obtain the corresponding speech audio domain Bark domain characteristic signal; and determine the corresponding speech audio domain Bark domain characteristic signal as the corresponding speech audio domain signal containing the reverberation frequency.
  • the dereverberation device also includes: an acquisition unit.
  • the acquisition unit is used to acquire the speech time domain signal including the reverberation time domain signal input by the microphone; perform sampling processing on the input speech time domain signal including the reverberation time domain signal according to the preset sampling strategy, and obtain the to-be-processed speech time domain signal.
  • the processing unit is also used to obtain pre-constructed training data.
  • the pre-constructed training data includes: multiple speech frequency domain signals containing reverberation frequency domain signals and multiple speech frequency domain signals without reverberation frequency domain signals. Signal; use pre-constructed training data to train the neural network model to obtain the trained neural network model, and determine the trained neural network model as the preset neural network model.
  • FIG. 8 is a first block diagram of an electronic device used to implement the dereverberation method according to the embodiment of the present application.
  • the electronic device 300 includes: a memory 301 and a processor 302 .
  • Memory 301 stores computer execution instructions
  • the processor executes the computer execution instructions stored in the memory 302, so that the processor executes the method provided by any of the above embodiments.
  • Figure 9 is a second block diagram of an electronic device used to implement the dereverberation method according to the embodiment of the present application.
  • the electronic device can be a computer, a digital broadcast terminal, a messaging device, a tablet device, or a personal digital assistant. , servers, server clusters, etc.
  • Electronic device 800 may include one or more of the following components: processing component 802, memory 804, power supply component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communications component 816 .
  • Processing component 802 generally controls the overall operations of electronic device 800, such as operations associated with display, phone calls, data communications, camera operations, and recording operations.
  • the processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the above method.
  • processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components.
  • processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.
  • Memory 804 is configured to store various types of data to support operations at electronic device 800 . Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, etc.
  • Memory 804 may be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EEPROM), Programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EEPROM erasable programmable read-only memory
  • EPROM Programmable read-only memory
  • PROM programmable read-only memory
  • ROM read-only memory
  • magnetic memory flash memory, magnetic or optical disk.
  • Power supply component 806 provides power to various components of electronic device 800 .
  • Power supply components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 800 .
  • Multimedia component 808 includes a screen that provides an output interface between electronic device 800 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. A touch sensor can not only sense the boundaries of a touch or swipe action, but also detect the duration and pressure associated with the touch or swipe action.
  • multimedia component 808 includes a front-facing camera and/or a rear-facing camera.
  • the front camera and/or the rear camera may receive external multimedia data.
  • Each front-facing camera and rear-facing camera can be a fixed optical lens system or have a focal length and optical zoom capabilities.
  • Audio component 810 is configured to output and/or input audio signals.
  • audio component 810 includes a microphone (MIC) configured to receive external audio signals when electronic device 800 is in operating modes, such as call mode, recording mode, and voice recognition mode. The received audio signal may be further stored in memory 804 or sent via communication component 816 .
  • audio component 810 also includes a speaker for outputting audio signals.
  • the I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, etc. These buttons may include, but are not limited to: Home button, Volume buttons, Start button, and Lock button.
  • Sensor component 814 includes one or more sensors for providing various aspects of status assessment for electronic device 800 .
  • the sensor component 814 can detect the open/closed state of the electronic device 800, the relative positioning of components, such as the display and keypad of the electronic device 800.
  • the sensor component 814 can also detect the electronic device 800 or a component of the electronic device 800. Position changes, the presence or absence of user contact with the electronic device 800 , electronic device 800 orientation or acceleration/deceleration and temperature changes of the electronic device 800 .
  • Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 816 is configured to facilitate wired or wireless communication between electronic device 800 and other devices.
  • the electronic device 800 can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof.
  • communication component 816 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel.
  • communications component 816 also includes a near field communications (NFC) module to facilitate short-range communications.
  • NFC near field communications
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • electronic device 800 may be configured by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation is used to perform the above method.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable A programmable gate array
  • controller microcontroller, microprocessor or other electronic component implementation is used to perform the above method.
  • non-transitory computer-readable storage medium including instructions, such as a memory 804 including instructions, which can be executed by the processor 820 of the electronic device 800 to complete the above method is also provided.
  • non-transitory computer-readable storage media may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
  • a computer-readable storage medium is also provided.
  • Computer-executable instructions are stored in the computer-readable storage medium, and the computer-executable instructions are used by a processor to execute the method in any of the above embodiments.
  • a computer program product including a computer program, the computer program being used by a processor to execute the method in any of the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种去混响方法、装置、设备及存储介质。该方法包括:获取待处理的语音频域信号,并根据语音频域信号确定对应的语音频域特征信号(101);将语音频域特征信号输入至预设的神经网络模型中,输出混响抑制语音频域特征信号(102);根据混响抑制语音频域特征信号确定对应的预估混响频域信号(103);基于预估混响频域信号对语音频域信号进行滤波处理,以获取去混响的语音频域信号(104)。

Description

去混响方法、装置、设备及存储介质
本申请要求于2022年06月14日提交中国专利局、申请号为2022106644420、申请名称为“去混响方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频信号处理技术领域,尤其涉及一种去混响方法、装置、设备及存储介质。
背景技术
随着人工智能快速发展,语言不仅是人与人之间的交流方式,也成为了人与机器沟通的重要手段,人工智能语音识别技术作为人机交流接口,成为了人与机器沟通的关键技术。随着智能音箱、智能电视等智能语音识别产品的发展,越来越多的智能产品通过设置麦克风识别用户的语音。在室内使用麦克风识别语音信号,不可避免地会受到来自于室内墙壁、顶部天花板和其它障碍物反射信号的干扰,因而语音信号会发生线性畸变。这种畸变通常称之为混响,又称交混回响,是指在声源停止发声后,声音延续所引起的交混现象。它将退化语音的保真度和可懂度,使得语音通信系统和语音自动识别系统的性能下降。
现有技术中的基于深度神经网络的混响抑制方法,如全卷积时域音频分离网络Conv-TasNet,将模型输出结果作为最终输出,得到去混响的语音。
但是直接将模型输出结果作为最终输出,但输出的语音畸变比较大,不利于后续语音识别,且影响语音唤醒。
发明内容
本申请提供一种去混响方法、装置、设备及存储介质,用以解决直接将模型输出结果作为最终输出,不利于后续语音识别的问题。
第一方面,本申请提供一种去混响方法,包括:
获取待处理的语音频域信号,所述语音频域信号为包含混响频域信号的语音频域信号,并根据所述语音频域信号确定对应的语音频域特征信号,所述语音频域特征信号为包含混 响频域信号的语音频域特征信号;
将所述语音频域特征信号输入至预设的神经网络模型中,输出混响抑制语音频域特征信号;
根据混响抑制语音频域特征信号确定对应的预估混响频域信号;
基于预估混响频域信号对所述语音频域信号进行滤波处理,以获取去混响的语音频域信号。
第二方面,本申请提供一种去混响装置,包括:
确定单元,用于获取待处理的语音频域信号,所述语音频域信号为包含混响频域信号的语音频域信号,并根据所述语音频域信号确定对应的语音频域特征信号,所述语音频域特征信号为包含混响频域信号的语音频域特征信号;
处理单元,用于将所述语音频域特征信号输入至预设的神经网络模型中,输出混响抑制语音频域特征信号;
确定单元,还用于根据混响抑制语音频域特征信号确定对应的预估混响频域信号;
滤波单元,用于基于预估混响频域信号对所述语音频域信号进行滤波处理,以获取去混响的语音频域信号。
第三方面,本申请提供一种电子设备,包括:处理器,以及与所述处理器通信连接的存储器;
所述存储器存储计算机执行指令;
所述处理器执行所述存储器存储的计算机执行指令,使得所述处理器执行如第一方面所述的方法。
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,所述计算机执行指令被处理器执行时用于实现如第一方面所述的方法。
本申请提供的一种去混响方法、装置、设备及存储介质,通过获取待处理的语音频域信号,所述语音频域信号为包含混响频域信号的语音频域信号,并根据所述语音频域信号确定对应的语音频域特征信号,所述语音频域特征信号为包含混响频域信号的语音频域特征信号;将所述语音频域特征信号输入至预设的神经网络模型中,输出混响抑制语音频域特征信号;根据混响抑制语音频域特征信号确定对应的预估混响频域信号;基于预估混响频域信号对所述语音频域信号进行滤波处理,以获取去混响的语音频域信号。
附图说明
图1是本申请提供的去混响方法的应用场景示意图;
图2是本申请实施例一提供的去混响方法的流程示意图;
图3是本申请实施例二提供的去混响方法的流程示意图;
图4是本申请实施例三提供的去混响方法的流程示意图;
图5是本申请实施例四提供的去混响方法的流程示意图;
图6是本申请实施例七提供的去混响方法的流程示意图;
图7是本申请一实施例提供的去混响装置的结构示意图;
图8是用来实现本申请实施例的去混响方法的电子设备的第一框图;
图9是用来实现本申请实施例的去混响方法的电子设备的第二框图。
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述。这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或智能设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或智能设备固有的其它步骤或单元。
为了清楚理解本申请的技术方案,首先对现有技术的方案进行详细介绍。
随着人工智能快速发展,语言不仅是人与人之间的交流方式,也成为了人与机器沟通的重要手段,人工智能语音识别技术作为人机交流接口,成为了人与机器沟通的关键技术。随着智能音箱、智能电视等智能语音识别产品的发展,越来越多的智能产品通过设置麦克风识别用户的语音。在室内使用麦克风识别语音信号,不可避免地会受到来自于室内墙壁、顶部天花板和其它障碍物反射信号的干扰,因而语音信号会发生线性畸变。这种畸变通常称之为混响,又称交混回响,是指在声源停止发声后,声音延续所引起的交混现象。它将 退化语音的保真度和可懂度,使得语音通信系统和语音自动识别系统的性能下降。现有技术中的基于深度神经网络的混响抑制方法,如全卷积时域音频分离网络Conv-TasNet,这是一种端到端时域语音分离的深度学习框架。Conv-TasNet使用一个线性编码器来生成语音波形,优化的语音波形可以分离单独的说话人声音。说话人声音分离是通过对编码器输出应用一组加权函数(mask)来实现的,将模型输出结果作为最终输出,得到去混响的语音。
但是直接将模型输出结果作为最终输出,输出的语音畸变比较大,不利于后续语音识别,且影响语音唤醒。
所以针对现有技术中直接将模型输出结果作为最终输出,不利于后续语音识别的问题,发明人在研究中发现,将神经网络与滤波相结合,获取待处理的语音频域信号,语音频域信号为包含混响频域信号的语音频域信号,并根据语音频域信号确定对应的语音频域特征信号,语音频域特征信号为包含混响频域信号的语音频域特征信号;将语音频域特征信号输入至预设的神经网络模型中,输出混响抑制语音频域特征信号;根据混响抑制语音频域特征信号确定对应的预估混响频域信号;基于预估混响频域信号对语音频域信号进行滤波处理,以获取去混响的语音频域信号。将神经网络模型及滤波处理结合使用,将神经网络输出的结果进行滤波处理,能够有效地减少语音的畸变,提升后续的语音唤醒率和识别率。
所以发明人基于上述的创造性发现,提出了本申请实施例的技术方案。下面对本申请实施例提供的去混响方法的应用场景进行介绍。
如图1所示,用户输入语音信号,传入的人声和混响时域信号混合后形成语音时域信号。智能音响1获取麦克风输入的包含混响时域信号的语音时域信号;根据预设采样策略对输入的包含混响时域信号的语音时域信号进行采样处理,获得待处理的包含混响时域信号的语音时域信号;对待处理的包含混响时域信号的语音时域信号进行傅里叶变换,获得待处理的语音频域信号,语音频域信号为包含混响频域信号的语音频域信号。智能音响1获取待处理的语音频域信号,该语音频域信号为包含混响频域信号的语音频域信号,并根据语音频域信号确定对应的语音频域特征信号,该语音频域特征信号为包含混响频域信号的语音频域特征信号;将语音频域特征信号输入至预设的神经网络模型中,输出混响抑制语音频域特征信号;根据混响抑制语音频域特征信号确定对应的预估混响频域信号;基于预估混响频域信号对语音频域信号进行滤波处理,以获取去混响的语音频域信号;将去混响的语音频域信号转换为去混响的语音时域信号,并对去混响的语音时域信号进行语音识别处理。将神经网络模型及滤波处理结合使用,将神经网络输出的结果进行滤波处理,能够有效地减少语音的畸变,提升后续的语音唤醒率和识别率。
以下将参照附图来具体描述本申请的实施例。
实施例一
图2是本申请实施例一提供的去混响方法的流程示意图,如图2所示,本实施例提供的去混响方法的执行主体为去混响装置,该去混响装置位于电子设备中,则本实施例提供的去混响方法包括以下步骤:
步骤101,获取待处理的的语音频域信号,语音频域信号为包含混响频域信号的语音频域信号,并根据语音频域信号确定对应的语音频域特征信号,语音频域特征信号为包含混响频域信号的语音频域特征信号。
本实施例中,获取待处理的语音频域信号,其中,语音频域信号为包含混响频域信号的语音频域信号,进一步对待处理的语音频域信号进行特征提取,得到对应的语音频域特征信号,其中,语音频域特征信号为包含混响频域信号的语音频域特征信号。其中,特征包括Bark域,MFCC,Fbank等。
步骤102,将语音频域特征信号输入至预设的神经网络模型中,输出混响抑制语音频域特征信号。
本实施例中,预设的神经网络模型包含一维卷积层、LSTM层、线性层及激活层,将语音频域特征信号输入至预设的神经网络模型中,输出混响抑制语音频域特征信号。
其中,卷积层使用卷积核进行特征提取和特征映射。LSTM其背后的长短期记忆(LSTM,long short-term memory)算法,LSTM层是SimpleRNN层的一种变体,它增加了一种携带信息跨越多个时间步的方法。线性层又称为全连接层,在全连接层中所有神经元都有权重连接,通常全连接层在卷积神经网络尾部。当前面卷积层抓取到足以用来识别图片的特征后,接下来的就是如何进行分类。通常卷积网络的最后会将末端得到的长方体平摊成一个长长的向量,并送入全连接层配合输出层进行分类。激活层是激活函数对特征继续非线性变换,赋予多层神经网络具有深度的意义。
步骤103,根据混响抑制语音频域特征信号确定对应的预估混响频域信号。
本实施例中,对混响抑制语音频域特征信号进行处理,得到增益语音频域特征信号,从而根据混响抑制语音频域特征信号及对应的增益语音频域特征信号确定预估混响频域信号,预估混响频域信号为估计的混响成分。
步骤104,基于预估混响频域信号对语音频域信号进行滤波处理,以获取去混响的语音频域信号。
本实施例中,基于预估混响频域信号对待处理的包含混响频域信号的语音频域信号进行滤波处理,可以采用归一化最小均方误差进行滤波处理,从而得到去混响的语音频域信号。
本实施例中,获取待处理的语音频域信号,进一步根据待处理的语音频域信号确定对应的语音频域特征信号,将该语音频域特征信号输出至预设的神经网络模型中,输出混响抑制语音频域特征信号,以根据混响抑制语音频域特征信号确定预估混响频域信号,基于预估混响频域信号对待处理的语音频域信号进行滤波处理,得到去混响的语音频域信号。基于对神经网络模型的输出结果估计混响成分,从而基于混响成分进行滤波处理。将神经网络模型及滤波处理结合使用,将神经网络输出的结果进行滤波处理,能够有效地减少语音的畸变,提升后续的语音唤醒率和识别率。
实施例二
图3是本申请实施例二提供的去混响方法的流程示意图,如图3所示,在本申请实施例一提供的去混响方法的基础上,对步骤103进行了进一步细化,具体包括以下步骤:
步骤1031,根据混响抑制语音频域特征信号获取对应的增益语音频域特征信号。
本实施例中,混响抑制语音频域特征信号进行转换,主要是将64维的混响抑制语音频域特征信号转换为257维的混响抑制语音频域特征信号,获取转换后的混响抑制语音频域特征信号,转换后的即为对应的增益语音频域特征信号。
步骤1032,根据混响抑制语音频域特征信号及对应的增益语音频域特征信号确定对应的预估混响频域信号。
本实施例中,根据混响抑制语音频域特征信号以及对应的增益语音频域特征信号计算对应的预估混响频域信号,该预估混响频域信号可以被认为是估计的混响成分。进一步基于预估混响频域信号对待处理的号的语音频域信号进行滤波处理,从而除去待处理的语音频域信号中混响成分,得到不包含混响的语音频域信号。
本实施例中,根据神经网络模型输出的混响抑制语音频域特征信号计增益语音频域特征信号即可确定预估混响频域信号,能够直接得到估计的混响成分。
实施例三
图4是本申请实施例三提供的去混响方法的流程示意图,如图4所示,在本申请实施例二提供的去混响方法的基础上,对步骤1032进行了进一步细化,具体包括以下步骤:
步骤1032a,计算预设增益信号与增益语音频域特征信号差值,获得第一语音频域信号。
本实施例中,将预设增益信号与增益语音频域特征信号代入公式(1),计算预设增益信号与增益语音频域特征信号差值,公式(1)表示为:
Y=A-M        公式(1)
其中,Y为第一语音频域信号,A为预设增益信号,M为增益语音频域特征信号,其 中A的取值为1。
步骤1032b,将混响抑制语音频域特征信号与第一语音频域信号相乘,获得对应的预估混响频域信号。
本实施例中,混响抑制语音频域特征信号与第一语音频域信号代入公式(2),计算得到对应的预估混响频域信号,公式(2)表示为:
B=N×(A-M)       公式(2)
其中,B为预估混响频域信号,N为混响抑制语音频域特征信号,A为预设增益信号,M为增益语音频域特征信号,其中A的取值为1。
本实施例中,通过增益部分及混响抑制语音频域特征信号能够得到预估混响频域信号,准确估计混响成分。
实施例四
在本申请实施例一提供的去混响方法的基础上,对步骤104进行了进一步细化,具体包括以下步骤:
步骤1041,采用归一化最小均方误差算法确定预估混响频域信号对应的校准后的混响频域信号,并根据对应的校准后的混响频域信号及语音频域信号确定去混响的语音频域信号。
本实施例中,采用归一化最小均方误差算法(NLMS,Normalized Least Mean Square)确定预估混响频域信号对应的校准后的混响频域信号,采用归一化最小均方误差算法进行滤波处理。其中,NLMS滤波器的阶数取值可为3~10。滤波器的阶数是指过滤谐波的次数,一般来讲,同样的滤波器,其阶数越高,滤波效果就越好。但是阶数越高,相应的计算量越大,故NLMS滤波器的阶数可设置为5,也可是其他适合的数值。预估混响频域信号为估计的混响频域信号,需要对预估混响频域信号进行校准,首先计算预估混响频域信号对应的校准后的混响频域信号,公式(3)表示为:
y(k)=w(k) T×x(k)       公式(3)
其中,y为校准后的混响频域信号,w为滤波器系数,x估计混响频域信号。
进一步,根据对应的校准后的混响频域信号及语音频域信号确定去混响的语音频域信号,将校准后的混响频域信号及待处理的包含混响频域信号的语音频域信号代入公式(4),得到去混响的语音频域信号。
e(k)=d(k)-y(k)       公式(4)
其中,e为去混响的语音频域信号,d为待处理的包含混响频域信号的语音频域信号,k为校准后的混响频域信号。
本实施例中,采用归一化最小均方误差算法能够有效地去除包含混响频域信号的语音频域信号中的混响成分,通过进一步校准预估的混响成分,能够得到较为准确的混响成分,从而基于校准后的混响频域信号得到干净的去混响的语音频域信号,能够有效地提高后续识别的准确性。
实施例五
在本申请实施例一提供的去混响方法的基础上,步骤104之后,还包括以下步骤:
步骤105,将去混响的语音频域信号转换为去混响的语音时域信号,并对去混响的语音时域信号进行语音识别处理。
本实施例中,采用的傅里叶逆变换将去混响的语音频域信号转换为去混响的语音时域信号,从而对去混响的语音时域信号进行语音识别处理,获取识别结果。
本实施例中,通过消除混响成分,能够有效地提高语音识别准确性。
实施例六
图5是本申请实施例六提供的去混响方法的流程示意图,如图5所示,在本申请实施例一提供的去混响方法的基础上,对步骤102进行了进一步细化,具体包括以下步骤:
步骤1021,对语音频域信号进行Bark域特征提取,获得对应的语音频域Bark域特征信号。
本实施例中,对包含混响频域信号的语音频域特征信号进行特征提取,具体地,对包含混响频域信号的语音频域特征信号进行Bark域特征提取,获得对应的语音Bark域特征信号,Bark域对低频具有放大作用,对高频具有压缩作用。
步骤1022,将对应的语音频域Bark域特征信号确定为对应的包含混响频域信号的语音频域特征信号。
本实施例中,将对应的语音频域Bark域特征信号确定为对应的包含混响频域信号的语音频域特征信号。
本实施例中,Bark域相比于线性频域更符合人耳的听觉掩蔽效应。Bark域具有对低频的放大作用及对高频的压缩作用,能更清晰地揭示哪些信号容易产生掩蔽和哪些噪声比较明显,能够提升准确率。
实施例七
图6是本申请实施例七提供的去混响方法的流程示意图,如图6所示,在本申请实施例一至实施例六提供的去混响方法的基础上,步骤101之前,还包括以下步骤:
步骤101a,获取麦克风输入的包含混响时域信号的语音时域信号。
本实施例中,电子设备可以是智能音响,获取智能音响麦克风输入的包含混响时域信 号的语音时域信号。
步骤101b,根据预设采样策略对输入的包含混响时域信号的语音时域信号进行采样处理,获得待处理的包含混响时域信号的语音时域信号。
本实施例中,预设采样策略包括采样频率,采样长度,如采样频率为16k,采样长度为512,根据预设采样策略对输入的包含混响时域信号的语音时域信号进行采样处理,得到待处理的包含混响时域信号的语音时域信号。
步骤101c,对待处理的包含混响时域信号的语音时域信号进行傅里叶变换,获得待处理的语音频域信号。
本实施例中,为了更好地分析信号,对待处理的包含混响时域信号的语音时域信号进行傅里叶变换,其中,采用可以采用短时傅里叶变换,短时傅里叶变换(STFT,short-timeFouriertransform,或short-termFouriertransform))是和傅里叶变换相关的一种数学变换,用以确定时变信号其局部区域正弦波的频率与相位。将时域信号转变为频域信号,能够更好地分析语音信号。
实施例八
在本申请实施例一至实施例六提供的去混响方法的基础上,步骤102之前,还包括以下步骤:
步骤102a,获取预先构建的训练数据,预先构建的训练数据包括:多个包含混响频域信号的语音频域信号及多个不含混响频域信号的语音频域信号。
本实施例中,获取预先构建的训练数据,预先构建的训练数据包括多个包含混响频域信号的语音频域信号及多个不含混响频域信号的语音频域信号,其中,不含混响频域信号的语音频域信号是通过采集得到的,即直达声。其中,包含混响频域信号的语音频域信号是由不含混响频域信号的语音频域信号进行卷积冲击得到的信号,属于模拟信号,采用rir工具生成,通过设置不同混响时间、房间大小、声源和麦克风仿真得到的模拟信号。其中,包含混响频域信号的语音频域信号的样本数量与多个不含混响频域信号的语音频域信号的样本数量的比例为8:2。
步骤102b,采用预先构建的训练数据对神经网络模型进行训练,以获取训练的神经网络模型,将训练的神经网络模型确定为预设的神经网络模型。
本实施例中,预先构建的神经网络模板,包含一维卷积层、LSTM层、线性层及激活层,其中,一维卷积层设置如下:in_channels=64,out_channels=128,kernel_size=4,stride=1,padding=1,输入通道为64,输出通道为128,卷积核为4,步长为1,填充为1。其中,LSTM(Long Short-Term Memory)是长短期记忆网络层,LSTM设置如下:Input_size=128, hidden_size=64,num_layers=1,输入层的特征数量为128,隐藏层特征数量为64,层数为一层。其中,线性层设置如下:Input_size=64,out_size=64,输入层的特征数量为64,输出层的特征数量为64。其中,激活层所使用的激活函数为sigmoid函数,Sigmoid函数是一个在生物学中常见的S型函数,也称为S型生长曲线。在信息科学中,由于其单增以及反函数单增等性质,Sigmoid函数常被用作神经网络的激活函数,将变量映射到0,1之间。此外,还需要定义损失函数了,对每一个训练样本,都沿着神经网络传递得到一个数字,然后将这个数字与想要得到的实际数字做差再求平方,计算出来的就是预测值与真实值之间的距离,而训练神经网络就是要将这个距离或损失函数减小。采用预先构建的数据对神经网络模型进行训练,从而获取训练的神经网络模型,将该训练后的神经网络模型确定为预设的神经网络模型。通过对神经网络模型的训练,使得神经网络模型输出更加符合实际。
图7是本申请一实施例提供的去混响装置的结构示意图,如图7所示,本实施例提供的去混响装置200包括确定单元201,处理单元202,滤波单元203。
其中,确定单元201,用于获取待处理的语音频域信号,语音频域信号为包含混响频域信号的语音频域信号,并根据语音频域信号确定对应的语音频域特征信号,语音频域特征信号为包含混响频域信号的语音频域特征信号。处理单元202,用于将语音频域特征信号输入至预设的神经网络模型中,输出混响抑制语音频域特征信号。确定单元201,还用于根据混响抑制语音频域特征信号确定对应的预估混响频域信号。滤波单元203,用于基于预估混响频域信号对语音频域信号进行滤波处理,以获取去混响的语音频域信号。
可选地,确定单元,还用于根据混响抑制语音频域特征信号获取对应的增益语音频域特征信号;根据混响抑制语音频域特征信号及对应的增益语音频域特征信号确定对应的预估混响频域信号。
可选地,确定单元,还用于计算预设增益信号与增益语音频域特征信号差值,获得第一语音频域信号;将混响抑制语音频域特征信号与第一语音频域信号相乘,获得对应的预估混响频域信号。
可选地,滤波单元,还用于采用归一化最小均方误差算法确定预估混响频域信号对应的校准后的混响频域信号,并根据对应的校准后的混响频域信号及语音频域信号确定去混响的语音频域信号。
可选地,去混响装置还包括:识别单元。
其中,识别单元,用于将去混响的语音频域信号转换为去混响的语音时域信号,并对去混响的语音时域信号进行语音识别处理。
可选地,确定单元,还用于对语音频域信号进行Bark域特征提取,获得对应的语音频 域Bark域特征信号;将对应的语音频域Bark域特征信号确定为对应的包含混响频域信号的语音频域特征信号。
可选地,去混响装置还包括:获取单元。
其中,获取单元,用于获取麦克风输入的包含混响时域信号的语音时域信号;根据预设采样策略对输入的包含混响时域信号的语音时域信号进行采样处理,获得待处理的包含混响时域信号的语音时域信号;对待处理的包含混响时域信号的语音时域信号进行傅里叶变换,获得待处理的语音频域信号。
可选地,处理单元,还用于获取预先构建的训练数据,预先构建的训练数据包括:多个包含混响频域信号的语音频域信号及多个不含混响频域信号的语音频域信号;采用预先构建的训练数据对神经网络模型进行训练,以获取训练的神经网络模型,将训练的神经网络模型确定为预设的神经网络模型。
图8是用来实现本申请实施例的去混响方法的电子设备的第一框图,如图8所示,该电子设备300包括:存储器301,处理器302。
存储器301存储计算机执行指令;
处理器执行302存储器存储的计算机执行指令,使得处理器执行上述任意一个实施例提供的方法。
图9是用来实现本申请实施例的去混响方法的电子设备的第二框图,如图9所示,该电子设备可以是计算机,数字广播终端,消息收发设备,平板设备,个人数字助理,服务器,服务器集群等。
电子设备800可以包括以下一个或多个组件:处理组件802,存储器804,电源组件806,多媒体组件808,音频组件810,输入/输出(I/O)接口812,传感器组件814,以及通信组件816。
处理组件802通常控制电子设备800的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件802可以包括一个或多个处理器820来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件802可以包括一个或多个模块,便于处理组件802和其他组件之间的交互。例如,处理组件802可以包括多媒体模块,以方便多媒体组件808和处理组件802之间的交互。
存储器804被配置为存储各种类型的数据以支持在电子设备800的操作。这些数据的示例包括用于在电子设备800上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器 (EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件806为电子设备800的各种组件提供电力。电源组件806可以包括电源管理系统,一个或多个电源,及其他与为电子设备800生成、管理和分配电力相关联的组件。
多媒体组件808包括在电子设备800和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件808包括一个前置摄像头和/或后置摄像头。当电子设备800处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件810被配置为输出和/或输入音频信号。例如,音频组件810包括一个麦克风(MIC),当电子设备800处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中,音频组件810还包括一个扬声器,用于输出音频信号。
I/O接口812为处理组件802和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件814包括一个或多个传感器,用于为电子设备800提供各个方面的状态评估。例如,传感器组件814可以检测到电子设备800的打开/关闭状态,组件的相对定位,例如组件为电子设备800的显示器和小键盘,传感器组件814还可以检测电子设备800或电子设备800一个组件的位置改变,用户与电子设备800接触的存在或不存在,电子设备800方位或加速/减速和电子设备800的温度变化。传感器组件814可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件814还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件816被配置为便于电子设备800和其他设备之间有线或无线方式的通信。电子设备800可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一 个示例性实施例中,通信组件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,通信组件816还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,电子设备800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器804,上述指令可由电子设备800的处理器820执行以完成上述方法。例如,非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,计算机执行指令被处理器执行上述任意一个实施例中的方法。
在示例性实施例中,还提供了一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行上述任意一个实施例中的方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求书指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求书来限制。

Claims (18)

  1. 一种去混响方法,包括:
    获取待处理的语音频域信号,所述语音频域信号为包含混响频域信号的语音频域信号,并根据所述语音频域信号确定对应的语音频域特征信号,所述语音频域特征信号为包含混响频域信号的语音频域特征信号;
    将所述语音频域特征信号输入至预设的神经网络模型中,输出混响抑制语音频域特征信号;
    根据混响抑制语音频域特征信号确定对应的预估混响频域信号;
    基于预估混响频域信号对所述语音频域信号进行滤波处理,以获取去混响的语音频域信号。
  2. 根据权利要求1所述的方法,其中,所述根据混响抑制语音频域特征信号确定对应的预估混响频域信号,包括:
    根据混响抑制语音频域特征信号获取对应的增益语音频域特征信号;
    根据混响抑制语音频域特征信号及对应的增益语音频域特征信号确定对应的预估混响频域信号。
  3. 根据权利要求2所述的方法,其中,所述根据混响抑制语音频域特征信号及对应的增益语音频域特征信号确定对应的预估混响频域信号,包括:
    计算预设增益信号与增益语音频域特征信号差值,获得第一语音频域信号;
    将混响抑制语音频域特征信号与第一语音频域信号相乘,获得对应的预估混响频域信号。
  4. 根据权利要求1所述的方法,其中,所述基于预估混响频域信号对所述语音频域信号进行滤波处理,以获取去混响的语音频域信号,包括:
    采用归一化最小均方误差算法确定预估混响频域信号对应的校准后的混响频域信号,并根据对应的校准后的混响频域信号及所述语音频域信号确定去混响的语音频域信号。
  5. 根据权利要求1所述的方法,其中,所述基于预估混响频域信号对所述语音频域信号进行滤波处理,以获取去混响的语音频域信号之后,还包括:
    将去混响的语音频域信号转换为去混响的语音时域信号,并对去混响的语音时域信号进行语音识别处理。
  6. 根据权利要求1所述的方法,其中,所述根据所述语音频域信号确定对应的语音频域特征信号,包括:
    对所述语音频域信号进行Bark域特征提取,获得对应的语音频域Bark域特征信号;
    将对应的语音频域Bark域特征信号确定为对应的包含混响频域信号的语音频域特征信号。
  7. 根据权利要求1-6任一项所述的方法,其中,所述获取待处理的语音频域信号之前,还包括:
    获取麦克风输入的包含混响时域信号的语音时域信号;
    根据预设采样策略对输入的包含混响时域信号的语音时域信号进行采样处理,获得待处理的包含混响时域信号的语音时域信号;
    对待处理的包含混响时域信号的语音时域信号进行傅里叶变换,获得所述待处理的语音频域信号。
  8. 根据权利要求1-6任一项所述的方法,其中,所述将所述语音频域特征信号输入至预设的神经网络模型中,输出混响抑制语音频域特征信号之前,还包括:
    获取预先构建的训练数据,所述预先构建的训练数据包括:多个包含混响频域信号的语音频域信号及多个不含混响频域信号的语音频域信号;
    采用预先构建的训练数据对神经网络模型进行训练,以获取训练的神经网络模型,将所述训练的神经网络模型确定为预设的神经网络模型。
  9. 一种去混响装置,所述装置包括:
    确定单元,用于获取待处理的语音频域信号,所述语音频域信号为包含混响频域信号的语音频域信号,并根据所述语音频域信号确定对应的语音频域特征信号,所述语音频域特征信号为包含混响频域信号的语音频域特征信号;
    处理单元,用于将所述语音频域特征信号输入至预设的神经网络模型中,输出混响抑制语音频域特征信号;
    确定单元,还用于根据混响抑制语音频域特征信号确定对应的预估混响频域信号;
    滤波单元,用于基于预估混响频域信号对所述语音频域信号进行滤波处理,以获取去混响的语音频域信号。
  10. 根据权利要求9所述的装置,其中,所述确定单元,还用于根据混响抑制语音频域特征信号获取对应的增益语音频域特征信号;根据混响抑制语音频域特征信号及对应的增益语音频域特征信号确定对应的预估混响频域信号。
  11. 根据权利要求10所述的装置,其中,所述确定单元,还用于计算预设增益信号与增益语音频域特征信号差值,获得第一语音频域信号;将混响抑制语音频域特征信号与第一语音频域信号相乘,获得对应的预估混响频域信号。
  12. 根据权利要求9所述的装置,其中,所述滤波单元,还用于采用归一化最小均方误差算法确定预估混响频域信号对应的校准后的混响频域信号,并根据对应的校准后的混响频域信号及所述语音频域信号确定去混响的语音频域信号。
  13. 根据权利要求9所述的装置,其中,所述装置还包括:识别单元;
    所述识别单元,还用于将去混响的语音频域信号转换为去混响的语音时域信号,并对去混响的语音时域信号进行语音识别处理。
  14. 根据权利要求9所述的装置,其中,所述确定单元,还用于对所述语音频域信号进行Bark域特征提取,获得对应的语音频域Bark域特征信号;将对应的语音频域Bark域特征信号确定为对应的包含混响频域信号的语音频域特征信号。
  15. 根据权利要求9-14任一项所述的装置,其中,所述装置还包括:获取单元;
    所述获取单元,用于获取麦克风输入的包含混响时域信号的语音时域信号;根据预设采样策略对输入的包含混响时域信号的语音时域信号进行采样处理,获得待处理的包含混响时域信号的语音时域信号;对待处理的包含混响时域信号的语音时域信号进行傅里叶变换,获得所述待处理的语音频域信号。
  16. 根据权利要求9-14任一项所述的装置,其中,所述处理单元,还用于获取预先构建的训练数据,所述预先构建的训练数据包括:多个包含混响频域信号的语音频域信号及多个不含混响频域信号的语音频域信号;采用预先构建的训练数据对神经网络模型进行训练,以获取训练的神经网络模型,将所述训练的神经网络模型确定为预设的神经网络模型。
  17. 一种电子设备,包括:处理器,以及与所述处理器通信连接的存储器;
    所述存储器存储计算机执行指令;
    所述处理器执行所述存储器存储的计算机执行指令,使得所述处理器执行如权利要求1-8任一项所述的方法。
  18. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,所述计算机执行指令被处理器执行时用于实现如权利要求1-8任一项所述的方法。
PCT/CN2022/128051 2022-06-14 2022-10-27 去混响方法、装置、设备及存储介质 WO2023240887A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210664442.0 2022-06-14
CN202210664442.0A CN117275500A (zh) 2022-06-14 2022-06-14 去混响方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023240887A1 true WO2023240887A1 (zh) 2023-12-21

Family

ID=89193107

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/128051 WO2023240887A1 (zh) 2022-06-14 2022-10-27 去混响方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN117275500A (zh)
WO (1) WO2023240887A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160035367A1 (en) * 2013-04-10 2016-02-04 Dolby Laboratories Licensing Corporation Speech dereverberation methods, devices and systems
CN107302737A (zh) * 2016-04-14 2017-10-27 哈曼国际工业有限公司 利用反褶积滤波器进行的基于神经网络的扬声器建模
CN113823304A (zh) * 2021-07-12 2021-12-21 腾讯科技(深圳)有限公司 语音信号的处理方法、装置、电子设备及可读存储介质
CN114242100A (zh) * 2021-12-16 2022-03-25 北京百度网讯科技有限公司 音频信号处理方法、训练方法及其装置、设备、存储介质
CN114495960A (zh) * 2021-12-25 2022-05-13 浙江大华技术股份有限公司 音频降噪滤波方法、降噪滤波装置、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160035367A1 (en) * 2013-04-10 2016-02-04 Dolby Laboratories Licensing Corporation Speech dereverberation methods, devices and systems
CN107302737A (zh) * 2016-04-14 2017-10-27 哈曼国际工业有限公司 利用反褶积滤波器进行的基于神经网络的扬声器建模
CN113823304A (zh) * 2021-07-12 2021-12-21 腾讯科技(深圳)有限公司 语音信号的处理方法、装置、电子设备及可读存储介质
CN114242100A (zh) * 2021-12-16 2022-03-25 北京百度网讯科技有限公司 音频信号处理方法、训练方法及其装置、设备、存储介质
CN114495960A (zh) * 2021-12-25 2022-05-13 浙江大华技术股份有限公司 音频降噪滤波方法、降噪滤波装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN117275500A (zh) 2023-12-22

Similar Documents

Publication Publication Date Title
CN110808063A (zh) 一种语音处理方法、装置和用于处理语音的装置
CN110970057B (zh) 一种声音处理方法、装置与设备
CN111128221B (zh) 一种音频信号处理方法、装置、终端及存储介质
CN108346433A (zh) 一种音频处理方法、装置、设备及可读存储介质
WO2017152601A1 (zh) 一种麦克风确定方法和终端
CN108766457A (zh) 音频信号处理方法、装置、电子设备及存储介质
CN111009257B (zh) 一种音频信号处理方法、装置、终端及存储介质
CN111179960B (zh) 音频信号处理方法及装置、存储介质
CN110853664A (zh) 评估语音增强算法性能的方法及装置、电子设备
CN110931028B (zh) 一种语音处理方法、装置和电子设备
WO2022256577A1 (en) A method of speech enhancement and a mobile computing device implementing the method
US20240177726A1 (en) Speech enhancement
JP2024507916A (ja) オーディオ信号の処理方法、装置、電子機器、及びコンピュータプログラム
CN110232909A (zh) 一种音频处理方法、装置、设备及可读存储介质
CN112447184B (zh) 语音信号处理方法及装置、电子设备、存储介质
CN105244037B (zh) 语音信号处理方法及装置
WO2023240887A1 (zh) 去混响方法、装置、设备及存储介质
WO2023287782A1 (en) Data augmentation for speech enhancement
CN113488066B (zh) 音频信号处理方法、音频信号处理装置及存储介质
CN111667842B (zh) 音频信号处理方法及装置
CN113345461A (zh) 一种语音处理方法、装置和用于语音处理的装置
WO2023230782A1 (zh) 一种音效控制方法、装置及存储介质
CN113113036B (zh) 音频信号处理方法及装置、终端及存储介质
US20230223033A1 (en) Method of Noise Reduction for Intelligent Network Communication
CN111429934B (zh) 音频信号处理方法及装置、存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22946555

Country of ref document: EP

Kind code of ref document: A1