WO2022068440A1 - 啸叫抑制方法、装置、计算机设备和存储介质 - Google Patents

啸叫抑制方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2022068440A1
WO2022068440A1 PCT/CN2021/112769 CN2021112769W WO2022068440A1 WO 2022068440 A1 WO2022068440 A1 WO 2022068440A1 CN 2021112769 W CN2021112769 W CN 2021112769W WO 2022068440 A1 WO2022068440 A1 WO 2022068440A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
current
subband
signal
howling
Prior art date
Application number
PCT/CN2021/112769
Other languages
English (en)
French (fr)
Inventor
高毅
罗程
李斌
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP21874102.3A priority Critical patent/EP4131254A4/en
Publication of WO2022068440A1 publication Critical patent/WO2022068440A1/zh
Priority to US17/977,380 priority patent/US20230046518A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/018Audio watermarking, i.e. embedding inaudible data in the audio signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • H04M9/082Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic using echo cancellers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/02Circuits for transducers, loudspeakers or microphones for preventing acoustic reaction, i.e. acoustic oscillatory feedback

Definitions

  • the present application relates to the field of computer technology, and in particular, to a howling suppression method, apparatus, computer device and storage medium.
  • voice calls can be made based on the network, for example, voice calls of various instant messaging applications.
  • two or more voice call devices are often located in close distances. For example, in the same room, howling is very likely to occur at this time, thereby affecting the voice call. the quality of.
  • howling is usually avoided by adjusting the distance between the voice communication devices. However, when the distance adjustment cannot be performed, howling will be generated, thereby reducing the quality of the voice communication.
  • a howling suppression method, apparatus, computer device, and storage medium are provided.
  • a method for suppressing howling, executed by computer equipment, the method includes:
  • the frequency domain audio signal is divided to obtain each subband, and the target subband is determined from each subband;
  • the howling suppression is performed on the target subband based on the gain of the current subband, and the first target audio signal corresponding to the current time period is obtained.
  • a howling suppression device includes:
  • the signal transformation module is used to obtain the current audio signal corresponding to the current time period, and perform frequency domain transformation on the current audio signal to obtain the frequency domain audio signal;
  • a subband determination module used to divide the frequency domain audio signal, obtain each subband, and determine the target subband from each subband
  • a coefficient determination module configured to obtain the current howling detection result and the current voice detection result corresponding to the current audio signal, and determine the subband gain coefficient corresponding to the current audio signal based on the current howling detection result and the current voice detection result;
  • a gain determination module for obtaining the historical subband gain corresponding to the audio signal of the historical time period, and calculating the current subband gain corresponding to the current audio signal based on the subband gain coefficient and the historical subband gain;
  • the howling suppression module is configured to perform howling suppression on the target subband based on the current subband gain to obtain the first target audio signal corresponding to the current time period.
  • a computer device comprising a memory and a processor, wherein the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the following steps are implemented when the processor is executed:
  • the frequency domain audio signal is divided to obtain each subband, and the target subband is determined from each subband;
  • the howling suppression is performed on the target subband based on the gain of the current subband, and the first target audio signal corresponding to the current time period is obtained.
  • One or more non-volatile storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps when executed:
  • the frequency domain audio signal is divided to obtain each subband, and the target subband is determined from each subband;
  • the howling suppression is performed on the target subband based on the gain of the current subband, and the first target audio signal corresponding to the current time period is obtained.
  • FIG. 1 is an application environment diagram of a howling suppression method in one embodiment
  • FIG. 2 is a schematic flowchart of a howling suppression method in one embodiment
  • 2a is a schematic diagram of the relationship between the frequency and energy of an audio signal in a specific embodiment
  • FIG. 3 is a schematic flowchart of obtaining a current audio signal in one embodiment
  • FIG. 5 is a schematic flowchart of obtaining a current audio signal in another embodiment
  • FIG. 6 is a schematic flowchart of obtaining a current audio signal in another embodiment
  • FIG. 7 is a schematic flowchart of obtaining subband gain coefficients in one embodiment
  • FIG. 8 is a schematic flowchart of obtaining a second target audio signal in one embodiment
  • Figure 8a is a schematic diagram of a curve of energy constraint in a specific embodiment
  • FIG. 9 is a schematic flowchart of a howling suppression method in a specific embodiment
  • FIG. 10 is a schematic diagram of an application scenario of a howling suppression method in a specific embodiment
  • FIG. 11 is a schematic diagram of an application framework of a howling suppression method in a specific embodiment
  • FIG. 12 is a schematic flowchart of a howling suppression method in a specific embodiment
  • FIG. 13 is a schematic diagram of an application framework of a howling suppression method in another specific embodiment
  • FIG. 14 is a schematic diagram of an application framework of a howling suppression method in another specific embodiment
  • FIG. 15 is a structural block diagram of a howling suppression apparatus in one embodiment
  • Figure 16 is a diagram of the internal structure of a computer device in one embodiment.
  • the howling suppression method provided by the embodiment of the present application can be applied to the application environment shown in FIG. 1 .
  • the terminal 102 communicates with the server 106 through the network
  • the terminal 104 communicates with the server 106 through the network
  • the terminal 102 and the terminal 104 communicate through the server 106.
  • the distance between the terminal 102 and the terminal 104 is relatively close, for example, in the same room .
  • the terminal 102 and the terminal 104 may be either a sending terminal for sending voice or a receiving terminal for receiving voice.
  • the terminal 102 or the terminal 104 obtains the current audio signal corresponding to the current time period, and performs frequency domain transformation on the current audio signal to obtain the frequency domain audio signal; Determine the target subband in the subband; the terminal 102 or the terminal 104 obtains the current howling detection result and the current voice detection result corresponding to the current audio signal, and determines the subband gain corresponding to the current audio signal based on the current howling detection result and the current voice detection result coefficient; the terminal 102 or the terminal 104 obtains the historical subband gain corresponding to the audio signal of the historical time period, and calculates the current subband gain corresponding to the current audio signal based on the subband gain coefficient and the historical subband gain; the terminal 102 or the terminal 104 based on the current subband gain
  • the band gain performs howling suppression on the target subband to obtain the first target audio signal corresponding to the current time period.
  • the terminal can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices, and the server can
  • a howling suppression method is provided, and the method is applied to the terminal in FIG. 1 as an example for description. It can be understood that the method can also be applied to a server, It can also be applied to a system including a terminal and a server, and is realized through the interaction between the terminal and the server. In this embodiment, the following steps are included:
  • Step 202 Acquire a current audio signal corresponding to the current time period, and perform frequency domain transformation on the current audio signal to obtain a frequency domain audio signal.
  • the audio signal is the information carrier of the frequency and amplitude variation of sound waves with speech, music and sound effects.
  • the current audio signal refers to an audio signal that needs howling suppression, that is, there is a howling signal in the current audio signal. Due to problems such as the close distance between the sound source and the amplification equipment, the energy will self-excite, resulting in whistling.
  • the whistling signal refers to the audio signal corresponding to the whistling, and the whistling is often sharp and harsh.
  • the current audio signal may be an audio signal that needs howling suppression obtained after collecting the audio signal through a microphone or other acquisition device and performing signal processing.
  • the signal processing may include echo cancellation, noise suppression, and howling detection.
  • Echo cancellation refers to eliminating the noise generated by the return path of the air generated by the acquisition equipment such as microphones and the playback equipment such as speakers through sound wave interference.
  • Noise suppression refers to the extraction of pure original audio from the noise-containing frequency, and the audio signal without background noise.
  • Howling detection refers to detecting whether there is a howling signal in the audio signal.
  • the current audio signal may also be an audio signal that requires howling suppression and is obtained after receiving the audio signal through the network and processing, and the signal processing may be howling detection.
  • the current time period refers to the time period in which the current audio signal is located, that is, the time period after the audio signal is framed for speech. For example, the length of the current time period may be within 10ms to 30ms.
  • Frequency domain transformation refers to transforming the current audio signal from the time domain to the frequency domain.
  • the time domain is used to describe the relationship between the audio signal and time.
  • the time domain waveform of the audio signal can express the change of the audio signal over time.
  • the frequency domain is used to describe the signal.
  • a coordinate system used to characterize the frequency in which the audio signal varies with frequency.
  • a frequency domain plot shows the amount of signal in each given frequency band within a frequency range.
  • the frequency domain representation may also include information about the phase shift of each sinusoid, so that the frequency components can be recombined to recover the original time signal.
  • the frequency domain audio signal refers to the audio signal obtained by transforming the current audio signal from the time domain to the frequency domain.
  • the terminal can collect voice through microphones and other acquisition devices, obtain the audio signal of the current time period, and then perform howling detection on the audio signal. Howling is checked by parametric criteria such as peak/average ratio. Howling can also be detected based on the pitch period in the audio signal. Howling can also be detected based on the energy in the audio signal.
  • the terminal When there is a howling signal in the audio signal, the current audio signal corresponding to the current time period is obtained. Then, the current audio signal is transformed in the frequency domain through Fourier transform to obtain an audio signal in the frequency domain.
  • the terminal before performing howling detection on the collected audio signal, the terminal may also perform processing such as echo cancellation and noise suppression on the collected audio signal.
  • the terminal can also obtain the voice sent by other voice call terminals through the network, obtain the audio signal of the current time period, and then perform howling detection on the audio signal.
  • the current audio frequency corresponding to the current time period is obtained. signal, and then transform the current audio signal in the frequency domain through the Fourier transform to obtain the audio signal in the frequency domain.
  • the terminal may also acquire the audio signal sent by the server, and then perform howling detection on the audio signal, and when there is a howling signal in the audio signal, obtain the current audio signal corresponding to the current time period.
  • Step 204 Divide the frequency domain audio signal to obtain each subband, and determine the target subband from each subband.
  • the sub-band refers to a sub-band obtained by dividing the frequency-domain audio signal.
  • the target sub-band refers to the sub-band for which howling suppression needs to be performed.
  • the terminal divides the frequency-domain audio signal, and can use a band-pass filter to divide the frequency-domain audio signal to obtain each sub-band, wherein the sub-band can be divided according to the preset number of sub-bands. It can be divided according to the preset frequency band range and so on. Then the energy of each sub-band is calculated, and the target sub-band is selected according to the energy of each sub-band.
  • the selected target sub-band may be one, for example, the sub-band with the maximum energy is the target sub-band, or there may be multiple ones.
  • the selected target sub-band may be preset according to the energy of the sub-bands in descending order. number of subbands.
  • Step 206 Obtain the current howling detection result and the current voice detection result corresponding to the current audio signal, and determine the subband gain coefficient corresponding to the current audio signal based on the current howling detection result and the current voice detection result.
  • the current howling detection result refers to a detection result obtained after howling detection is performed on the current audio signal, which may include the presence of a howling signal in the current audio signal and the absence of a howling signal in the current audio signal.
  • the current voice detection result refers to the detection result obtained after the current audio signal is detected by the voice endpoint, wherein the voice endpoint detection (Voice Activity Detection, VAD) refers to accurately locating the start and end of the voice from the current audio signal.
  • VAD Voice Activity Detection
  • the current voice detection result may include the presence of the voice signal in the current audio signal and the absence of the voice signal in the current audio signal.
  • the sub-band gain coefficient is used to indicate the degree of howling suppression required for the current audio signal. When the sub-band gain coefficient is smaller, it indicates that the degree of howling suppression needs to be performed on the current audio signal is higher. When the sub-band gain coefficient is larger, it indicates that the degree of howling suppression needs to be performed on the current audio signal is smaller.
  • the terminal can acquire the current howling detection result and the current voice detection result corresponding to the current audio signal, and the current howling detection result and the current voice detection result corresponding to the current audio signal may be performing howling suppression on the current audio signal. Howling detection and voice endpoint detection are performed before, and the current howling detection results and current voice detection results are obtained and saved in the memory.
  • the terminal may also obtain the current howling detection result and the current voice detection result corresponding to the current audio signal from a third party, which is a service party that performs howling detection and voice endpoint detection on the current audio signal.
  • a third party which is a service party that performs howling detection and voice endpoint detection on the current audio signal.
  • the terminal may obtain the current howling detection result and the current voice detection result corresponding to the saved current audio signal from the server.
  • Step 208 Obtain the historical subband gain corresponding to the audio signal of the historical time period, and calculate the current subband gain corresponding to the current audio signal based on the subband gain coefficient and the historical subband gain.
  • the historical time period refers to the historical time period corresponding to the current time period, and the time length of the historical time period may be the same as the time length of the current time period, or may be different from the time length of the current time period.
  • the historical time period may be a previous time period of the current time period, or may be multiple time periods before the current time period.
  • the historical time period may have a preset interval with the current time period, or may be directly connected with the current time period. For example, within a time period of 0ms to 100ms, the current time period may be 80ms to 100ms, and the historical time period may be a time period of 60ms to 80ms.
  • the audio signal of the historical time period refers to the audio signal after howling suppression has been performed.
  • the historical subband gain refers to the subband gain used when howling suppression is performed on the audio signal of the historical time period.
  • the current subband gain refers to the subband gain used when howling printing is performed on the current audio signal.
  • the terminal can obtain the historical subband gain corresponding to the audio signal of the historical time period from the memory, calculate the product of the subband gain coefficient and the historical subband gain, and obtain the current subband gain corresponding to the current audio signal.
  • the historical sub-band gain is the preset initial sub-band gain value, for example, the initial sub-band gain value can be 1, and the initial sub-band gain value of 1 means no Suppress the current audio signal.
  • the subband gain coefficient is less than one, it indicates that howling suppression needs to be performed on the current audio signal, and when the subband gain coefficient is greater than one, it indicates that howling suppression needs to be reduced for the current audio signal.
  • the current subband gain corresponding to the current audio signal is compared with the lower limit value of the preset subband gain, and when the current subband gain corresponding to the current audio signal is less than the lower limit value of the preset subband gain , and the lower limit of the preset subband gain is taken as the current subband gain corresponding to the current audio signal.
  • the current subband gain corresponding to the current audio signal is compared with the initial subband gain value, and when the current subband gain corresponding to the current audio signal is greater than the initial subband gain value, the initial subband gain value is used as The current subband gain corresponding to the current audio signal.
  • Step 210 Perform howling suppression on the target subband based on the current subband gain to obtain a first target audio signal corresponding to the current time period.
  • the first target audio signal refers to an audio signal obtained after howling suppression is performed on the target subband in the current audio signal.
  • the current subband gain is used to gain the spectrum of the target subband, and then the gain audio signal is converted from the frequency domain to the time domain by using the inverse Fourier transform algorithm to obtain the first target audio signal corresponding to the current time period.
  • the current audio signal corresponding to the current time period is collected by a collection device such as a microphone of the terminal, then the first target audio signal corresponding to the current time period may be encoded to obtain an encoded audio signal, Then, the encoded audio signal is sent to other terminals for voice calls through the network interface.
  • the terminal 102 collects audio signals through a microphone, performs echo cancellation and noise suppression, and obtains a current audio signal corresponding to the current time period, and then performs howling suppression on the current audio signal to obtain a corresponding current time period.
  • the first target audio signal corresponding to the current time period is sent to the terminal 104 through the server 106, and the terminal 104 receives the first target audio signal corresponding to the current time period and decodes it, and then decodes the decoded first target audio signal.
  • a target audio signal is played.
  • the volume of the first target audio signal can also be adjusted, for example, the volume of the first target audio signal can be increased, and then the volume of the first target audio with the increased volume
  • the signal is encoded, and then the encoded first target audio signal is sent to other terminals for voice calls through a network interface.
  • the current audio signal corresponding to the current time period is sent by other voice call terminals through a network interface.
  • the first target audio signal corresponding to the current time period can be directly played by voice.
  • the terminal 102 collects the audio signal through a microphone, performs echo cancellation and noise suppression, encodes the audio signal and sends it to the terminal 104 through the server 106, and the terminal 104 receives the encoded audio signal and decodes it.
  • Obtain the decoded audio signal process the decoded audio signal to obtain the current audio signal corresponding to the current time period, and then perform howling suppression on the current audio signal to obtain the first target audio signal corresponding to the current time period, and then Play the first target audio signal.
  • FIG. 2a it is a schematic diagram of the relationship between the frequency and the energy of the audio signal.
  • the abscissa represents frequency
  • the ordinate represents energy.
  • Different subbands are obtained based on frequency division.
  • the figure shows 9 subbands.
  • the band is the high frequency subband
  • the low frequency subband is the 1st subband and the 4th subband
  • the high frequency band is the 5th subband and the 9th subband.
  • the solid line in the figure represents the frequency versus energy curve when there is only a speech signal.
  • the dotted line represents the relationship between frequency and energy when there are voice signals and whistling signals in the audio signal.
  • the energy when there are voice signals and whistling signals in the audio signal is significantly more than that when there are only voice signals.
  • the energy of the eighth sub-band is the most, and the eighth sub-band is determined as the target sub-band. Howling suppression is performed on the target subband. Due to howling suppression is performed on the 8th subband, the energy of the 8th subband gradually decreases until the energy of the 6th subband is the maximum subband energy. Howling suppression is performed.
  • the above howling suppression method can be determined according to the current howling detection result and the current voice detection result by obtaining the current audio signal corresponding to the current time period, and then obtaining the current howling detection result and the current voice detection result corresponding to the current audio signal.
  • the subband gain coefficient corresponding to the current audio signal, and the current subband gain corresponding to the current audio signal is calculated through the subband gain coefficient and the historical subband gain, so that the obtained current subband gain is more accurate, and then the current subband gain is used. Howling suppression is performed on the target subband, so that howling can be suppressed accurately, the quality of the obtained first target audio signal corresponding to the current time period is improved, and the quality of voice calls can be improved.
  • step 202 acquiring the current audio signal corresponding to the current time period, including:
  • Step 302 Collect the initial audio signal corresponding to the current time period, perform echo cancellation on the initial audio signal, and obtain the initial audio signal after echo cancellation.
  • the initial audio signal refers to a digital audio signal converted after the user's voice is collected by a collection device such as a microphone.
  • the terminal collects the initial audio signal corresponding to the current time period, and uses an echo cancellation algorithm to perform echo cancellation on the initial audio signal to obtain the initial audio signal after echo cancellation, wherein the echo cancellation can be used.
  • the expected signal is estimated by an adaptive algorithm, which approximates the echo signal passing through the actual echo path, that is, the analog echo signal, and then subtracts the analog echo from the initial audio signal collected by the microphone and other acquisition equipment to obtain the initial echo cancellation after echo cancellation. audio signal.
  • the echo cancellation algorithm includes LMS (Least Mean Square, least mean square adaptive filtering) algorithm, RLS (Recursive Least Square, recursive least squares adaptive filtering) algorithm and APA (Affine Projection Algorithm, affine projection adaptive filtering) algorithm at least one of them.
  • Step 304 Perform voice endpoint detection on the initial audio signal after echo cancellation to obtain a current voice detection result.
  • voice endpoint detection algorithms include dual threshold detection methods, energy-based endpoint detection algorithms, endpoint detection algorithms based on cepstral coefficients, endpoint detection algorithms based on frequency band variance, endpoint detection algorithms based on autocorrelation similarity distance, and information entropy-based endpoint detection algorithms. Endpoint detection algorithms and more.
  • Step 306 Perform noise suppression on the echo-cancelled initial audio signal based on the current speech detection result to obtain an initial audio signal after noise suppression.
  • the current speech detection result is that the initial audio signal after echo cancellation does not contain a speech signal
  • Noise suppression can be done using a neural network model trained for noise removal, or using filters.
  • noise suppression is performed while retaining the voice signal as much as possible to obtain an initial audio signal after noise suppression.
  • the voice signal refers to a signal corresponding to the user's voice.
  • step 308 howling detection is performed on the initial audio signal after noise suppression to obtain a current howling detection result.
  • the terminal uses a howling detection algorithm to perform howling detection on the noise-suppressed initial audio signal to obtain the current howling detection result, where the howling detection algorithm may be a detection algorithm based on energy distribution, such as a peak harmonic power ratio. Algorithms, peak-to-peak ratio algorithms, and inter-frame peak hold algorithms, etc. It can also be a detection algorithm based on neural network and so on.
  • the howling detection algorithm may be a detection algorithm based on energy distribution, such as a peak harmonic power ratio. Algorithms, peak-to-peak ratio algorithms, and inter-frame peak hold algorithms, etc. It can also be a detection algorithm based on neural network and so on.
  • Step 310 when the current howling detection result is that there is a howling signal in the initial audio signal after noise suppression, the initial audio signal after noise suppression is used as the current audio signal corresponding to the current time period.
  • the terminal detects that there is a howling signal in the initial audio signal after noise suppression
  • the initial audio signal after noise suppression is used as the current audio signal corresponding to the current time period, and then the current audio signal corresponding to the current time period is used. Howling suppression is performed.
  • echo cancellation is performed on the collected initial audio signal
  • voice endpoint detection is performed on the initial audio signal after echo cancellation
  • noise suppression is performed based on the current voice detection result
  • the noise-suppressed initial audio signal is whistled. It is called detection.
  • the initial audio signal after noise suppression is used as the current audio signal corresponding to the current time period to ensure that the current audio signal obtained requires whistling suppression. audio signal.
  • step 304 the step performs voice endpoint detection on the initial audio signal after echo cancellation to obtain the current voice detection result, including:
  • the initial audio signal after echo cancellation is input into the speech endpoint detection model for detection, and the current speech detection result is obtained.
  • the speech endpoint detection model is obtained by training the neural network algorithm based on the training audio signal and the corresponding training speech detection result.
  • the neural network algorithm can be BP ((back propagation, feedforward neural network) neural network algorithm, LSTM (Long Short-Term Memory, long short-term memory artificial neural network) algorithm, RNN (Recurrent Neural Network, recurrent neural network) neural network Network algorithms, etc.
  • the training audio signal refers to the audio signal used when training the voice endpoint detection model
  • the training voice detection result refers to the voice detection result corresponding to the training audio signal
  • the training voice detection result includes the training audio signal including the voice signal and
  • the training audio signal does not contain the speech signal
  • the loss function uses the cross-entropy loss function and gradient descent method for optimization
  • the activation function uses the sigmoid function.
  • the terminal uses wavelet analysis to extract audio features in the initial audio signal after echo cancellation, the audio features include short-term zero-crossing rate, short-term energy, short-term amplitude spectrum kurtosis, short-term amplitude spectrum skewness, etc. etc., input the audio features into the speech endpoint detection model for detection, and obtain the output current speech detection result.
  • the current speech detection result includes that the initial audio signal after echo cancellation includes a speech signal and the initial audio signal after echo cancellation does not include a speech signal.
  • the speech endpoint detection model is obtained by using a neural network algorithm to train based on the training audio signal and the corresponding training speech detection results.
  • step 304 is to perform voice endpoint detection on the initial audio signal after echo cancellation to obtain a current voice detection result, including:
  • low-pass filtering refers to a filtering method.
  • the rule is that low-frequency signals can pass normally, while high-frequency signals that exceed the set threshold are blocked and weakened.
  • Signal energy refers to the short-term energy corresponding to low-frequency signals.
  • Energy fluctuation refers to the signal energy ratio between the low-frequency signal of the previous frame and the low-frequency signal of the next frame.
  • the terminal performs low-pass filtering on the initial audio signal after echo cancellation according to the preset low frequency value to obtain the low frequency signal, and the preset low frequency value may be 500 Hz.
  • the signal energy corresponding to each frame in the low-frequency signal is then calculated, and triangular filtering can be used to calculate the signal energy.
  • the ratio exceeds the preset energy ratio, it means that the initial audio signal after echo cancellation contains a speech signal.
  • the ratio does not exceed the preset energy ratio
  • the energy ratio is set, it means that the initial audio signal after echo cancellation does not contain a speech signal, so as to obtain the current speech detection result.
  • the low-frequency signal is obtained by low-pass filtering the initial audio signal after echo cancellation, and then the current speech detection result is determined according to the energy fluctuation of the low-frequency signal, which can make the obtained current speech detection result more accurate.
  • step 304 is to perform voice endpoint detection on the initial audio signal after echo cancellation to obtain the current voice detection result, including the steps:
  • the general sound is composed of a series of vibrations with different frequencies and amplitudes emitted by the sounding body.
  • One of these vibrations has the lowest frequency, and the sound produced by it is the fundamental sound, and the rest are overtones.
  • Pitch detection refers to the estimation of the pitch period, which is used to detect the trajectory curve that is exactly the same or as close as possible to the vibration frequency of the vocal cords.
  • the pitch period is the time each time the vocal cords open and close.
  • the terminal performs low-pass filtering on the initial audio signal after echo cancellation to obtain a low-frequency signal, and uses a pitch detection algorithm to perform pitch detection on the low-frequency signal to obtain a pitch period, wherein the pitch detection algorithm may include an autocorrelation method, an average amplitude difference Function method, parallel processing method, cepstrum method and simplified inverse filtering method, etc.
  • the pitch detection algorithm may include an autocorrelation method, an average amplitude difference Function method, parallel processing method, cepstrum method and simplified inverse filtering method, etc.
  • the pitch period it is determined whether the initial audio signal after echo cancellation contains a voice signal. That is, if the pitch period can be detected, it means that the initial audio signal after echo cancellation contains a voice signal. If the pitch period cannot be detected, it means that after echo cancellation The initial audio signal does not contain a speech signal, so as to obtain the current speech detection result.
  • the current speech detection result is obtained by detecting the pitch period, which improves the accuracy of obtaining the current speech detection result.
  • step 308 that is, performing howling detection on the initial audio signal after noise suppression to obtain the current howling detection result, including steps:
  • the howling detection model is based on the howling training audio signal and the corresponding training howling detection result using neural network algorithm for training owned.
  • the neural network algorithm can be BP ((back propagation, feedforward neural network) neural network algorithm, LSTM (Long Short-Term Memory, long short-term memory artificial neural network) algorithm, RNN (Recurrent Neural Network, recurrent neural network) neural network Network algorithm, etc.
  • Howling training audio signal refers to the audio signal used when training the howling detection model.
  • the training howling detection result refers to the howling detection result corresponding to the howling training audio signal, including the initial audio signal after noise suppression The howling signal is included in the original audio signal and the howling signal is not included in the original audio signal after noise suppression.
  • the terminal can extract audio features corresponding to the initial audio signal after noise suppression, and the audio features include MFCC (Mel-Frequency cepstrum coefficients, Mel frequency cepstrum coefficients) dynamic features, band representation vectors (band Representative Vectors) and various A type of audio fingerprint, the mel-frequency cepstral coefficients refer to the coefficients that make up the mel-frequency cepstral.
  • the audio fingerprint is obtained by extracting the digital features in the original audio signal after noise suppression by a specific algorithm in the form of identifiers, and the frequency band representation vector is an ordered index list of prominent tones in the frequency band.
  • the terminal inputs the extracted audio features into the howling detection model for detection, and obtains the current howling detection result.
  • step 308 is to perform howling detection on the initial audio signal after noise suppression to obtain the current howling detection result, including:
  • Step 402 Extract the initial audio feature corresponding to the initial audio signal after noise suppression.
  • the initial audio features refer to audio features extracted from the initial audio signal after noise suppression, and the initial audio features include Mel-Frequency cepstrum coefficients (MFCC, Mel-Frequency cepstrum coefficients) dynamic features, band representation vectors (band representative vectors) ) and at least one of various types of audio fingerprints.
  • MFCC Mel-Frequency cepstrum coefficients
  • band representation vectors band representative vectors
  • the terminal can also select the corresponding audio features according to the accuracy and the amount of calculation.
  • the frequency band representation vector and various types of audio fingerprints are used as the initial audio features.
  • dynamic features of mel-frequency cepstral coefficients, frequency band representation vectors, and various types of audio fingerprints can be used as initial audio features.
  • the terminal extracts the initial audio feature corresponding to the initial audio signal after noise suppression, for example, extracting the dynamic feature of the Mel-frequency cepstral coefficient can pre-emphasize the initial audio signal after noise suppression, and then divide it into frames.
  • the frame is subjected to windowing processing, and the result of the windowing processing is subjected to fast Fourier transform to obtain the transformed result, and the logarithmic energy is calculated on the transformed result through triangular filtering, and then the Mel frequency inverse is obtained after discrete cosine transform.
  • Spectral coefficient dynamic characteristics for example, extracting the dynamic feature of the Mel-frequency cepstral coefficient can pre-emphasize the initial audio signal after noise suppression, and then divide it into frames.
  • the frame is subjected to windowing processing, and the result of the windowing processing is subjected to fast Fourier transform to obtain the transformed result, and the logarithmic energy is calculated on the transformed result through triangular filtering, and then the Mel frequency inverse is obtained after discrete cosine transform.
  • Step 404 Acquire a first historical audio signal corresponding to the first historical time period, and extract a first historical audio feature corresponding to the first historical audio signal.
  • the first historical time period refers to a time period before the current time period, and has the same time length as the current time period, and there may be multiple first historical time periods. For example, if the current call is 2500ms, the length of the current time period is 300ms, that is, the current time period is 2200ms to 2500ms, and the preset interval is 20ms, then the first historical time period can be 200ms ⁇ 500ms, 220ms ⁇ 520ms, 240ms ⁇ 540ms, ..., 1980ms ⁇ 2280ms and 2000 ⁇ 2300ms.
  • the first historical audio signal refers to a historical audio signal corresponding to the first historical time period, and is an audio signal collected by a microphone in the first historical time period.
  • the first historical audio feature refers to the audio feature corresponding to the first historical audio signal, which may include Mel-Frequency cepstrum coefficients (MFCC, Mel-Frequency cepstrum coefficients) dynamic features, band representative vectors, and various types of At least one of audio fingerprints.
  • MFCC Mel-Frequency cepstrum coefficients
  • the terminal may acquire the first historical audio signal corresponding to the first historical time period from the cache, or may download the first historical audio signal corresponding to the first historical time period from the server. Then, the first historical audio feature corresponding to the first historical audio signal is extracted.
  • Step 406 Calculate the first similarity between the initial audio feature and the first historical audio feature, and determine the current howling detection result based on the first similarity.
  • the first similarity refers to the similarity between the initial audio feature and the first historical audio feature, and the similarity may be distance similarity or cosine similarity.
  • the terminal can use the similarity algorithm to calculate the first similarity between the initial audio feature and the first historical audio feature.
  • the first similarity exceeds the preset first similarity threshold, it indicates that the initial audio signal after noise suppression is If the first similarity does not exceed the preset first similarity threshold, it means that there is no howling signal in the initial audio signal after noise suppression, so that the current howling detection result is obtained.
  • multiple first historical audio signals may be acquired, first historical audio features corresponding to each first historical audio signal are calculated respectively, and each The first similarity between the first historical audio feature and the initial audio feature, and the duration for which the first similarity exceeds the preset first similarity threshold is counted.
  • the duration exceeds the preset duration it indicates that the noise There is a howling signal in the suppressed initial audio signal, and when the duration does not exceed the preset duration, it means that there is no howling signal in the initial audio signal after noise suppression, so that the current howling detection result is obtained.
  • the similarity determines the current howling detection result, so that the obtained current howling detection result is more accurate.
  • the howling suppression method further includes the steps of:
  • the current howling detection result is that there is a howling signal in the current audio signal
  • the to-be-played audio signal and the preset audio watermark signal are acquired, and the preset audio watermark signal is added to the to-be-played audio signal and played.
  • the audio signal to be played refers to the audio signal that the terminal will play through the playback device while the user is speaking. background sound).
  • the preset audio watermark signal refers to a preset audio signal used to represent the existence of a whistling signal in the audio signal sent through the network, and is a signal that is not easily perceptible to the human ear.
  • the preset audio watermark signal can be The high-frequency watermark signal selected by the ultrasonic segment.
  • the audio signals received by all the receiving terminals that receive voice signals will be the audio signals after howling suppression. signal, affecting the audio signal quality of all receiving terminals.
  • the sending terminal detects that there is a howling signal in the current audio signal, it does not perform howling suppression, obtains the audio signal to be played and the preset audio watermark signal, adds the preset audio watermark signal to the audio signal to be played, and performs Play, and then do not perform howling suppression on the current audio signal, and directly send the current audio signal to all receiving terminals through the network.
  • a single-frequency tone or multi-frequency tone of a preset frequency may be embedded in the high frequency band of the audio signal to be played, as a preset high frequency watermark signal.
  • multiple preset high-frequency watermark signals may be embedded in the audio signal to be played for playback.
  • a time-domain audio watermarking algorithm can also be used to add a preset audio watermark signal to the audio signal to be played.
  • a transform domain audio watermarking algorithm can also be used to add a preset audio watermark signal to the audio signal to be played.
  • the receiving terminal that generates the howling at this time can receive the audio signal to which the preset audio watermark signal is added and the current audio signal. Then, the receiving terminal that generates the howling detects the audio signal to which the preset audio watermark signal is added, and obtains the result that there is a howling signal in the current audio signal, and then suppresses the current audio signal to obtain the first target audio signal and play it, Avoid reducing the quality of the audio signal received by all receiving terminals.
  • the howling suppression method further includes the steps:
  • Collect a first audio signal corresponding to a first time period perform audio watermark detection based on the first audio signal, and determine that the first audio signal contains a target audio watermark signal.
  • the first audio signal refers to an audio signal collected by a collection device such as a microphone and sent by a terminal with a short distance through a playback device, and howling may occur between the terminal and the terminal with a short distance.
  • the audio watermark detection can be performed by using an audio watermark detection algorithm.
  • the audio watermark detection algorithm is used to detect the audio watermark signal added to the first audio signal. It can be an adjacent band energy ratio algorithm, and the adjacent band energy ratio algorithm can be calculated first. The ratio between the energies corresponding to each sub-band in the audio signal, and the audio watermark signal is extracted according to the ratio.
  • the target audio watermark signal refers to a preset audio watermark signal added to the first audio signal by a terminal with a relatively short distance.
  • the first time period refers to a time period corresponding to the first audio signal.
  • the terminal collects the first audio signal corresponding to the first time period through a collection device such as a microphone.
  • the first audio signal is divided into sub-bands, and the energy of each sub-band is calculated, and then the energy of adjacent sub-bands is compared to obtain the adjacent-band energy ratio.
  • the adjacent-band energy ratio exceeds the preset adjacent-band energy ratio threshold, It is determined that the target high-frequency watermark signal is contained in the first audio signal.
  • the audio signal received through the network contains a howling signal
  • the preset adjacent-band energy ratio threshold refers to the preset adjacent-band energy ratio threshold, It is used to detect whether it contains a preset high-frequency watermark signal.
  • the audio watermark signal added to the first audio signal can also be detected by a watermark extraction algorithm.
  • the second time period refers to a time period corresponding to the target network encoded audio signal.
  • the second time period follows the first time period.
  • the target network-encoded audio signal refers to the encoded current audio signal received through the network.
  • the target network audio signal refers to the current audio signal obtained after decoding the target network encoded audio signal.
  • the terminal receives the target network encoded audio signal corresponding to the second time period through the network, decodes the target network encoded audio signal, and obtains the target network audio signal.
  • Target network audio signal as the current audio signal based on the target audio watermark signal contained in the first audio signal.
  • the terminal uses the target network audio signal as the current audio signal according to the target audio watermark signal contained in the first audio signal.
  • the preset audio watermark signal when the terminal is a terminal that receives voice, the preset audio watermark signal can be detected by the collected first audio signal, and when it is detected that the preset audio watermark signal exists in the first audio signal, the preset audio watermark signal will be received through the network.
  • the received target network audio signal is used as the current audio signal, and then howling suppression is performed on the current audio signal to avoid affecting the quality of the audio signals received by all terminals, and whether to use the target network audio signal as the current audio signal is determined by detecting the preset audio watermark signal. Audio signal, improving the accuracy of the obtained current audio signal.
  • step 202 acquiring the current audio signal corresponding to the current time period, including:
  • Step 602 Receive the current network encoded audio signal corresponding to the current time period, decode the network encoded audio signal, and obtain the current network audio signal.
  • the current time period refers to the time period of the current network-encoded audio signal received by the terminal through the network.
  • the current network-encoded audio signal refers to the encoded audio signal received through the network.
  • the terminal when the terminal is a terminal receiving voice, the terminal receives the current network encoded audio signal corresponding to the current time period through the network interface, and decodes the network encoded audio signal to obtain the current network audio signal.
  • Step 604 Perform voice endpoint detection on the current network audio signal to obtain a network voice detection result, and simultaneously perform howling detection on the current network audio signal to obtain a network howling detection result.
  • the network language detection result refers to a result obtained by performing voice endpoint detection on the current network audio signal, including the current network audio signal including the speech signal and the current network audio signal not including the speech signal.
  • the network howling detection result refers to a result obtained by performing howling detection on the current network audio signal, and may include that the current network audio signal includes a howling signal and the current network audio signal does not include a howling signal.
  • voice endpoint detection is performed on the current network audio signal through a voice endpoint detection model to obtain a network voice detection result
  • howling detection is performed on the current network audio signal through a howling detection model to obtain a network howling detection result.
  • the current network audio signal can be low-pass filtered to obtain a low-frequency signal, the signal energy corresponding to the low-frequency signal is calculated, the energy fluctuation is calculated based on the signal energy, and the network voice corresponding to the current network audio signal is determined according to the energy fluctuation. Test results.
  • the current network audio signal can be low-pass filtered to obtain a low-frequency signal, the low-frequency signal is subjected to pitch detection to obtain a pitch period, and the network voice detection result corresponding to the current network audio signal is determined according to the pitch period.
  • the current network audio feature corresponding to the current network audio signal may be extracted, and the historical network audio feature may be obtained, the similarity between the historical network audio feature and the current network audio feature may be calculated, and the network howling detection result may be determined based on the similarity.
  • Step 606 extract the network audio feature of the current network audio signal, obtain the second historical audio signal of the second historical time period, and extract the second historical audio feature corresponding to the second historical audio signal.
  • the network audio feature refers to the audio feature corresponding to the current network audio signal.
  • the second historical time period refers to the time period corresponding to the second historical audio signal, and there may be multiple second historical time periods.
  • the second historical audio signal refers to a historical audio signal collected by a collection device such as a microphone.
  • the second historical audio feature refers to the audio feature corresponding to the second historical audio signal.
  • the terminal extracts the network audio feature of the current network audio signal, obtains the second historical audio signal of the second historical time period stored in the memory, and extracts the second historical audio feature corresponding to the second historical audio signal.
  • Step 608 Calculate the network audio similarity between the network audio feature and the second historical audio feature, and determine that the network audio signal is the current audio signal corresponding to the current time period based on the network audio similarity and the network howling detection result.
  • the network audio similarity refers to the degree of similarity between the current network audio signal and the second historical audio signal. The higher the network audio similarity is, the closer the distance between the terminal and the terminal sending the current network audio signal is.
  • the terminal calculates the network audio similarity between the network audio feature and the second historical audio feature through a similarity algorithm.
  • the network audio similarity exceeds the preset network audio similarity threshold and the network howling detection result is that the current network audio signal exists.
  • whistling signal take the network audio signal as the current audio signal corresponding to the current time period.
  • the preset network audio similarity threshold is a threshold used to determine the location of the terminal and the terminal sending the current network audio signal.
  • the network audio similarity exceeds the preset network audio similarity threshold it indicates that the terminal is connected to the terminal sending the current network audio signal.
  • the terminal positions are close to each other, which is prone to whistling.
  • the network audio similarity does not exceed the preset network audio similarity threshold, it means that the terminal is far away from the terminal sending the current network audio signal, and it is difficult to generate howling.
  • the terminal may acquire multiple second historical audio signals, extract a second historical audio feature corresponding to each second historical audio signal, and calculate each second historical audio feature and the second historical audio feature respectively.
  • the network audio similarity exceeds the preset network audio similarity threshold and the duration exceeds the preset threshold, it means that the terminal is close to the terminal that sends the current network audio signal.
  • the network audio similarity exceeds the preset network audio similarity
  • the duration of the audio similarity threshold does not exceed the preset threshold, it indicates that the terminal is far away from the terminal that sends the current network audio signal, and multiple refers to at least two.
  • the network audio signal is determined to be the current audio signal corresponding to the current time period based on the network audio similarity and the network howling detection result, so that the The determined current audio signal is more accurate
  • step 204 the frequency domain audio signal is divided to obtain each subband, and the target subband is determined from each subband, including:
  • the frequency domain audio signal is divided according to the preset number of subbands to obtain each subband.
  • the sub-band energy corresponding to each sub-band is calculated, and the energy of each sub-band is smoothed to obtain the smoothed energy of each sub-band.
  • the target subband is determined based on the smoothed energy of each subband.
  • the preset number of subbands is the preset number of subbands to be divided.
  • the terminal divides the frequency domain audio signal unevenly according to the preset number of subbands to obtain each subband.
  • the terminal then calculates the subband energy corresponding to each subband, and the subband energy may be volume or logarithmic energy. That is, in one embodiment, a triangular filter may be used to calculate the subband energy corresponding to each subband.
  • the energy of each subband can be calculated by 30 triangular filters.
  • the frequency range of each subband may not be equal, and there may be overlap in frequency between adjacent subbands.
  • the energy of each sub-band is smoothed, that is, the energy corresponding to the sub-band at the same position existing in the recent time period is obtained, and then the average value is calculated to obtain the smoothed sub-band energy of the sub-band.
  • the average subband energy is taken as the smoothed subband energy of the first subband in the current audio signal.
  • the smoothed sub-band energy corresponding to each sub-band is obtained by calculating in turn.
  • the smoothed sub-band energies are compared, and the sub-band with the largest sub-band energy is selected as the target sub-band, and the target sub-band contains the most howling energy.
  • the sub-band with the largest sub-band energy may be selected starting from the specified sub-band. For example, the current audio signal is divided into 30 subbands, and the subband corresponding to the maximum smoothed subband energy can be selected from the 6th to 30th subbands.
  • a preset number of subbands may be selected as target subbands in descending order according to the comparison result. For example, the top three subbands in the descending order of subband energy are selected as target subbands.
  • the selected target sub-band is more accurate.
  • determining the target subband based on the smoothed energy of each subband includes:
  • the howling subband refers to a subband including a howling signal.
  • the howling subband energy refers to the energy corresponding to the howling subband.
  • the target energy refers to the maximum howling subband energy.
  • the target whistling subband refers to the whistling subband corresponding to the maximum whistling subband energy.
  • the terminal obtains the current howling detection result corresponding to the current audio signal, when the current howling detection result is that there is a howling signal in the current audio signal, according to the frequency of the howling signal and the frequency of the voice signal from each subband.
  • the sub-bands corresponding to the howling signals are determined, so as to obtain each howling sub-band.
  • the energy corresponding to each howling subband is determined according to the energy of each subband. Then compare the energy of each howling sub-band, select the largest howling sub-band energy as the target energy, and take the target howling sub-band corresponding to the target energy as the target sub-band.
  • each howling subband corresponding to the energy of each howling subband may be directly used as the target subband, that is, the subband gain coefficient corresponding to each howling subband is calculated, and the corresponding Calculate the product of the sub-band gain coefficient and the historical sub-band gain to obtain the current sub-band gain corresponding to each howling sub-band, and perform howling suppression for each howling sub-band based on each current sub-band gain, and obtain the first target audio signal.
  • each howling subband is determined from each subband according to the current howling detection result, and then the target subband is determined from each howling subband, which improves the accuracy of obtaining the target subband.
  • step 206 based on the current howling detection result and the current voice detection result, determine the subband gain coefficient corresponding to the current audio signal, including:
  • Step 702 when the current voice detection result is that the current audio signal does not contain a voice signal and the current howling detection result is that the current audio signal contains a whistling signal, obtain a preset decrement coefficient, and use the preset decrement coefficient as the corresponding current audio signal.
  • the preset decreasing coefficient refers to a preset coefficient for decreasing the subband gain. Can be a value less than 1.
  • the terminal detects that the current audio signal does not contain a voice signal and there is a howling signal in the current audio signal, the terminal obtains a preset decrement coefficient, and uses the preset decrement coefficient as the subband gain coefficient corresponding to the current audio signal. That is, when it is detected that the current audio signal does not contain a voice signal and there is a whistling signal in the current audio signal, the sub-band gain needs to be gradually decreased from the initial value until there is no whistling signal in the current audio signal or a sub-band of the current audio signal.
  • the band gain has reached a predetermined lower limit value. For example, 0.08.
  • Step 704 when the current voice detection result is that the current audio signal contains a voice signal and the current howling detection result is that the current audio signal contains a whistling signal, obtain a preset first increment coefficient, and use the preset first increment coefficient as the current The subband gain coefficient corresponding to the audio signal;
  • Step 706 when the current howling detection result is that the current audio signal does not contain a howling signal, obtain a preset second increment coefficient, and use the preset second increment coefficient as the subband gain coefficient corresponding to the current audio signal, wherein the preset second increment coefficient is It is assumed that the first increment coefficient is greater than the preset second increment coefficient.
  • the preset first increment coefficient refers to a preset coefficient that increases the subband gain when the current audio signal includes a speech signal and includes a howling signal.
  • the preset second increment coefficient refers to a preset coefficient that increases the subband gain when the current audio signal does not contain a howling signal. The preset first increment coefficient is greater than the preset second increment coefficient.
  • the preset first increment coefficient is used as the subband gain coefficient corresponding to the current audio signal.
  • the sub-band gain needs to be increased rapidly, so that the sub-band gain can be restored to the initial value.
  • the preset second increment coefficient is used as the sub-band gain coefficient corresponding to the current audio signal, and at this time, the sub-band gain of the current audio signal is restored according to the preset second increment coefficient to the initial value.
  • the preset first increment coefficient is greater than the preset second increment coefficient, indicating that the speed at which the sub-band gain returns to the initial value when the current audio signal contains a voice signal and contains a whistling signal is faster than the current audio signal does not contain a whistling signal recovery speed.
  • the current audio signal is acquired every 20ms, and the subband gain of the current audio signal is calculated. When the voice call starts, there is generally no howling signal, so the subband gain will remain unchanged.
  • the initial value of the sub-band gain of the current audio signal is decremented according to the preset decrement coefficient, and then when it is detected that there is a whistling signal and contains a voice signal, according to the preset
  • the first increment coefficient calculates the sub-band gain of the current audio signal, that is, rapidly increases the sub-band gain of the current audio signal, so that the sub-band gain returns to the initial value.
  • the sub-band gain coefficient is determined according to the current speech detection result and the howling detection result, so that the obtained sub-band gain coefficient can be more accurate, thereby making the howling suppression more accurate, and further improving the obtained first The quality of the target audio signal.
  • the howling suppression method further includes:
  • Step 802 Determine a target low frequency signal and a target high frequency signal from the current audio signal based on a preset low frequency range.
  • the preset low frequency range refers to the preset frequency range of the human voice, for example, less than 1400 Hz.
  • the target low frequency signal refers to an audio signal within a preset low frequency range in the current audio signal, and the target high frequency signal refers to an audio signal in the current audio signal that exceeds the preset low frequency range.
  • the terminal divides the current audio signal according to the preset low frequency range to obtain the target low frequency signal and the target high frequency signal.
  • the audio signal less than 1400Hz in the current audio signal is used as the target low frequency signal
  • the audio signal exceeding 1400Hz in the current audio signal is used as the target high frequency signal.
  • Step 804 Calculate the low frequency energy corresponding to the target low frequency signal, and smooth the low frequency energy to obtain the smoothed low frequency energy.
  • the low-frequency energy refers to the energy corresponding to the target low-frequency signal.
  • the terminal directly calculates the low-frequency energy corresponding to the target low-frequency signal, or divides the target low-frequency signal to obtain the sub-bands of each low-frequency signal, then calculates the energy corresponding to the sub-bands of each low-frequency signal, and then calculates the sub-bands of each low-frequency signal.
  • the sum of the corresponding energies of the band is obtained to obtain the low-frequency energy corresponding to the target low-frequency signal.
  • the low frequency energy is smoothed to obtain the smoothed low frequency energy.
  • the following formula (1) can be used for smoothing.
  • E v (t) refers to the smoothed low-frequency energy corresponding to the target low-frequency signal in the current audio signal corresponding to the current time period.
  • E v (t-1) refers to the historical low frequency energy corresponding to the historical low frequency signal in the historical audio signal corresponding to the previous historical time period.
  • E c refers to the low frequency energy corresponding to the target low frequency signal in the current audio signal corresponding to the current time period.
  • a refers to the smoothing coefficient, which is preset. Among them, the value of a when E c is greater than E v (t-1) can be different from the value of a when E c is less than E v (t-1), which is used to better track the rising and falling stages of energy.
  • Step 806 Divide the target high frequency signal to obtain each high frequency subband, and calculate the high frequency subband energy corresponding to each high frequency subband.
  • the terminal may divide the target high frequency signal to obtain each high frequency subband, and use a triangular filter to calculate the high frequency subband energy corresponding to each high frequency subband.
  • Step 808 Obtain the preset energy upper limit weight corresponding to each high frequency sub-band, and calculate the high frequency sub-band upper limit energy corresponding to each high frequency sub-band based on the preset energy upper limit weight corresponding to each high frequency sub-band and the smoothed low frequency energy .
  • the preset energy upper limit weight refers to the preset energy upper limit weight of the high frequency sub-bands, different high frequency sub-bands have different preset energy upper limit weights, and the high frequency sub-bands can be in the order of frequency from low to high
  • the weight of setting the upper limit of energy is decreased in turn.
  • the upper limit energy of the high frequency subband refers to the upper limit of the energy of the high frequency subband, and the energy of the high frequency subband cannot exceed the upper limit.
  • the terminal obtains the preset energy upper limit weight corresponding to each high frequency subband, and calculates the product of the preset energy upper limit weight corresponding to the high frequency subband and the smoothed low frequency energy, and obtains the high frequency corresponding to each high frequency subband.
  • the high frequency subband upper limit energy can be calculated using Equation (2).
  • k refers to the kth high frequency subband
  • E u (k) is a positive integer
  • E u (k) is the upper limit energy of the high frequency subband corresponding to the kth high frequency subband
  • E u (k) refers to the smoothed low-frequency energy corresponding to the target low-frequency signal
  • b(k) refers to the preset energy upper limit weight corresponding to the kth high-frequency sub-band, for example, the preset energy of each high-frequency sub-band
  • the upper bound weights can be (0.8, 0.7, 0.6, ...) in order.
  • Step 810 Calculate the ratio of the upper limit energy of the high frequency subband to the energy of the high frequency subband to obtain the upper limit gain of each high frequency subband.
  • the high-frequency sub-band upper limit gain refers to the corresponding upper-limit gain when the high-frequency sub-band is subjected to the sub-band gain, that is, the high-frequency sub-band cannot exceed the high-frequency sub-band upper limit gain when the high-frequency sub-band is subjected to the sub-band gain.
  • the terminal calculates the ratio of the upper limit energy of each high frequency subband to the corresponding high frequency subband energy, respectively, to obtain the upper limit gain of each high frequency subband.
  • Equation (3) can be used to calculate the high frequency subband cap gain.
  • E(k) refers to the high frequency subband energy corresponding to the kth high frequency subband.
  • E u (k) refers to the upper limit energy of the high frequency subband corresponding to the kth high frequency subband.
  • M(k) refers to the upper limit gain of the high frequency subband corresponding to the kth high frequency subband.
  • Step 812 calculate each high frequency subband gain corresponding to each high frequency subband, determine each high frequency subband target gain based on each high frequency subband upper limit gain and each high frequency subband gain, and determine each high frequency subband target gain based on each high frequency subband target gain. Howling suppression is performed on each high-frequency subband to obtain a second target audio signal corresponding to the current time period.
  • the high-frequency sub-band gain is calculated according to the high-frequency sub-band gain coefficient and the historical high-frequency sub-band gain.
  • the high frequency sub-band gain coefficient is determined according to the current howling detection result and the current voice detection result.
  • the historical high frequency subband gain refers to the gain of the high frequency subband corresponding to the historical audio signal of the historical time period.
  • the high frequency subband target gain is the gain used for howling suppression.
  • the second target audio signal refers to an audio signal obtained after all high-frequency subbands are subjected to howling suppression.
  • the terminal obtains the respective historical high-frequency sub-band gains corresponding to each historical high-frequency sub-band, and determines the respective high-frequency sub-band gain coefficients according to the current howling detection result and the current voice detection result, and calculates the respective historical high-frequency sub-bands.
  • the product of the sub-band gain and each high-frequency sub-band gain coefficient obtains each high-frequency sub-band gain corresponding to each high-frequency sub-band.
  • B(k) refers to the high-frequency sub-band target gain corresponding to the k-th high-frequency sub-band
  • G(k) refers to the high-frequency sub-band gain corresponding to the k-th high-frequency sub-band
  • M(k) is Refers to the upper limit gain of the high frequency subband corresponding to the kth high frequency subband.
  • FIG. 8a it is a schematic diagram of a curve of energy constraints.
  • the abscissa represents the frequency
  • the ordinate represents the energy.
  • Different subbands are obtained based on the frequency division, and the figure shows 9 sub-bands, the sub-band whose frequency is lower than 1400Hz is the low-band, the sub-band higher than 1400-Hz is the high-band, the low-band is the 1st sub-band and the 4th sub-band, and the high-band is the 5th sub-band and the 9th sub-band bring.
  • the curve C is the energy curve when there is only the voice signal in the audio signal.
  • Curve B refers to the energy constraint curve for high frequency signals.
  • Curve A refers to the energy curve when the audio signal includes a voice signal and a howling signal. It can be clearly seen that in the low frequency band, that is, the first sub-band and the fourth sub-band speech signal, no energy constraint is performed. In the high frequency band, that is, after the fourth sub-band, when a howling signal is included, the energy of the audio signal needs to be constrained to be below the curve B to obtain an audio signal after howling suppression.
  • the high-frequency sub-band energy of the high-frequency sub-band is constrained by using the high-frequency sub-band upper limit gain, so as to ensure the quality of the obtained second target audio signal.
  • the howling suppression method includes the following steps:
  • Step 902 collect the initial audio signal corresponding to the current time period through the microphone, perform echo cancellation on the initial audio signal, and obtain the initial audio signal after echo cancellation.
  • Step 904 Input the initial audio signal after echo cancellation into the voice endpoint detection model for detection, and obtain the current voice detection result. Noise suppression is performed on the echo-cancelled initial audio signal based on the current speech detection result to obtain the noise-suppressed initial audio signal.
  • Step 906 Extract the initial audio feature corresponding to the initial audio signal after noise suppression, obtain the first historical audio signal corresponding to the first historical time period, extract the first historical audio feature corresponding to the first historical audio signal, and calculate the initial audio feature The current howling detection result is determined based on the first similarity with the first historical audio feature.
  • Step 908 when the current howling detection result is that there is a howling signal in the initial audio signal after noise suppression, perform frequency domain transformation on the current audio signal to obtain a frequency domain audio signal;
  • Step 910 Divide the frequency domain audio signal according to the preset number of sub-bands, obtain each sub-band, calculate the sub-band energy corresponding to each sub-band, and smooth the energy of each sub-band to obtain the smoothed energy of each sub-band , and the target subband is determined based on the smoothed energy of each subband.
  • Step 912 when the current voice detection result is that the current audio signal does not contain a voice signal and the current howling detection result is that the current audio signal contains a whistling signal, obtain a preset decrement coefficient, and use the preset decrement coefficient as the corresponding current audio signal. subband gain factor.
  • Step 914 Obtain the historical subband gain corresponding to the audio signal of the historical time period, and calculate the current subband gain corresponding to the current audio signal based on the subband gain coefficient and the historical subband gain.
  • Step 916 performing howling suppression on the target subband based on the current subband gain, obtaining the first target audio signal corresponding to the current time period, and sending the first target audio signal corresponding to the current time period to the receiving first target through the network. Termination of audio signals.
  • the present application also provides an application scenario where the above-mentioned howling suppression method is applied. Feasible, the application of the howling suppression method in this application scenario is as follows:
  • FIG. 10 When a voice conference is performed in the enterprise WeChat application, as shown in Figure 10, it is a specific scenario application diagram of the howling suppression method, in which the terminal 1002 and the terminal 1004 are in the same room, and perform voip (Voice over Internet) with other terminals. Protocol, Voice over IP) calls. At this time, the voice collected by the microphone of the terminal 1002 will be sent to the terminal 1004 through the network, and after being played through the speaker of the terminal 1004, the microphone of the terminal 1002 will collect the voice again. Therefore, an acoustic loop is formed, and the cycle repeats. Produces an acoustic effect of "howling".
  • Voice over IP Voice over IP
  • FIG. 11 a schematic diagram of the structure of a howling suppression method is provided, as shown in FIG. 11 , in which the terminal processes the audio signal collected by the microphone through uplink audio processing, and then encodes and sends it through the network.
  • the audio signal obtained from the network interface is processed through downlink audio processing, and then the audio is played.
  • the sound collected by the terminal 1002 through the microphone will be encoded and sent to the network side after uplink audio processing to form a network signal.
  • Upstream audio processing includes performing echo cancellation on the audio signal, and performing voice endpoint detection on the echo-cancelled audio signal, that is, voice analysis to identify non-voice signals and voice signals. Noise printing is performed on non-voice signals to obtain audio signals after noise suppression. Then, howling detection is performed on the audio signal after noise suppression, and the howling detection result is obtained. Howling suppression is performed according to the howling detection result and the voice endpoint detection result to obtain a howling-suppressed voice signal, the volume of the howling-suppressed voice signal is controlled, and then encoded and sent.
  • the terminal 1002 performs signal analysis on the audio signal that needs howling suppression, that is, transforms the time domain into the frequency domain, obtains the frequency-domain transformed audio signal, and then converts the frequency-domain transformed audio signal according to the preset number of subbands and subbands.
  • the frequency range calculates the energy of each subband. Then, the energy of each sub-band is smoothed in time to obtain the smoothed energy of each sub-band. The largest smoothed sub-band energy is selected from the smoothed sub-band energies as the target sub-band.
  • howlFlag represents the howling detection result.
  • howlFlag 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD 1
  • VAD it means that the audio signal does not contain a voice signal.
  • the preset decreasing coefficient is obtained as the subband gain coefficient.
  • howlFlag 1 and VAD is 1
  • the preset first increasing coefficient is obtained as the subband gain coefficient.
  • the preset second increment coefficient is obtained as the subband gain coefficient.
  • obtain the historical sub-band gain used in the howling process of the previous audio signal calculate the product of the historical sub-band gain and the sub-band gain coefficient, obtain the current sub-band gain, and use the current sub-band gain to perform the target sub-band gain.
  • Howling suppression obtain the audio signal after howling suppression, and then send the audio signal after howling suppression from the network side.
  • the target low-frequency signal and the target high-frequency signal can also be determined from the current audio signal based on the preset low-frequency range; the low-frequency energy corresponding to the target low-frequency signal is calculated, and the low-frequency energy is smoothed to obtain the smoothed Low-frequency energy; divide the target high-frequency signal to obtain each high-frequency sub-band, and calculate the high-frequency sub-band energy corresponding to each high-frequency sub-band; obtain the preset energy upper limit weight corresponding to each high-frequency sub-band, based on each high-frequency sub-band The preset energy upper limit weight corresponding to the frequency sub-band and the smoothed low-frequency energy calculate the high-frequency sub-band upper limit energy corresponding to each high-frequency sub-band; Frequency subband upper limit gain; calculate the respective high frequency subband gains corresponding to each high frequency subband, determine each high frequency subband target gain based on each high frequency subband upper limit gain and each high frequency subband gain, and determine each high frequency sub
  • the howling suppression is performed on each high-frequency subband with the target gain to obtain a second target audio signal corresponding to the current time period, and the second target audio signal is sent through the network side.
  • the terminal 1004 receives the network signal through the network interface, it decodes to obtain the audio signal, and then performs downlink audio processing and then performs audio playback.
  • the downlink audio processing may be volume control and so on.
  • the uplink audio processing in the terminal 1004 can also use the same method to process the audio and then send it through the network side.
  • FIG. 13 a schematic diagram of the architecture of another howling suppression method is provided, as shown in FIG. 13 , specifically:
  • the terminal 1002 when the terminal 1002 sends an audio signal to each terminal, since the terminal 1002 is close to the terminal 1004, howling may be caused.
  • the other terminals include terminal 1008 , terminal 1010 and terminal 1012 . If it is far from the terminal 1002, howling will not be generated. At this time, howling suppression can be performed in the terminal that receives the audio signal. Specifically:
  • the terminal 1004 when the network signal sent by the terminal 1002 is received through the network interface, decoding is performed to obtain an audio signal.
  • the audio signal is generally a signal that has undergone echo cancellation and noise suppression at the sending terminal.
  • the terminal 1004 directly performs howling detection and voice endpoint detection on the audio signal, and obtains the howling detection result and the voice endpoint detection result.
  • the terminal 1004 collects historical audio signals of the same time length through a microphone, and performs local detection. The local detection is used to detect whether the terminal 1004 and the terminal 1002 are similar. Specifically: by extracting the audio features of the audio signals collected by the microphone with the same length of time, and extracting the audio features of the audio signals received through the network side, and then calculating the similarity.
  • the similarity exceeds the preset similarity threshold for a period of time, it means that the terminal 1004 and the terminal 1002 are similar, and the local detection result is that the terminal 1004 and the terminal 1002 are similar, indicating that the terminal 1004 is on the audio loop that causes the howling. terminal.
  • howling suppression is performed according to the local detection result, the howling detection result and the voice endpoint detection result, that is, the process as shown in Figure 12 is executed to suppress the howling to obtain a howling-suppressed audio signal, and then the terminal 1004 suppresses the howling The audio signal after that is played.
  • the whistle detection result is that the possibility of a whistling signal in the audio signal exceeds a preset local detection pause threshold
  • the operation of the local detection is suspended, and the operation is performed only according to the whistling detection result and the voice endpoint detection result. Howling suppression, saving terminal resources.
  • the processing method of the downlink audio of the terminal 1004 that is, the above-mentioned process of howling processing on the audio information can also be applied to the downlink audio processing in other terminals, for example, in the terminal 1002 .
  • FIG. 14 a schematic diagram of the architecture of another howling suppression method is also provided, specifically:
  • the terminal 1002 collects the current audio signal through the microphone, performs echo cancellation and noise printing on the current audio signal, and then performs howling detection to obtain the current howling detection result.
  • the current howling detection result is that there is a howling signal in the current audio signal
  • obtain the audio signal to be played and the preset audio watermark signal add the preset audio watermark signal to the audio signal to be played, and play it through the speaker, and simultaneously
  • the current audio signal is volume controlled and encoded into a network signal, and sent to the terminal 1004 through a network interface.
  • the terminal 1004 collects the audio signal played by the terminal 1002 through the speaker through the microphone, and then performs watermark detection, that is, calculates the adjacent band energy ratio of the collected audio signal. When the adjacent band energy ratio exceeds the preset adjacent band energy ratio threshold, it is determined to collect The audio signal contains the set audio watermark signal.
  • the terminal 1004 obtains the network signal sent by the terminal 1002, decodes it to obtain an audio signal, and performs howling suppression on the audio signal, that is, executes the process shown in FIG. The audio signal after howling suppression is played through the speaker.
  • the sending terminal adds an audio watermark signal to the played audio signal.
  • the receiving terminal Since the distance between the terminals that generate the howling is relatively close, the receiving terminal will collect the audio signal with the added audio watermark signal through the microphone, and perform watermark detection on the collected audio signal. Howling suppression improves the accuracy of howling suppression efficiency.
  • the terminal 1004 can also add an audio watermark signal when sending the audio signal through the network side, and the terminal 1002 can also perform watermark detection to determine whether to perform howling suppression on the received audio signal.
  • FIGS. 2 , 3-8 and 9 are shown in sequence according to the arrows, these steps are not necessarily executed sequentially in the sequence indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIG. 2, FIG. 3-8 and FIG. 9 may include multiple steps or multiple stages, and these steps or stages are not necessarily executed at the same time, but may be executed at different times, The order of execution of these steps or stages is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the steps or stages in the other steps.
  • a howling suppression apparatus 1500 is provided.
  • the apparatus may adopt a software module or a hardware module, or a combination of the two to become a part of a computer device.
  • the apparatus specifically includes: a signal Transform module 1502, subband determination module 1504, coefficient determination module 1506, gain determination module 1508 and howling suppression module 1510, wherein:
  • the signal transformation module 1502 is used to obtain the current audio signal corresponding to the current time period, and perform frequency domain transformation on the current audio signal to obtain the frequency domain audio signal;
  • the subband determination module 1504 is used to divide the frequency domain audio signal, obtain each subband, and determine the target subband from each subband;
  • the coefficient determination module 1506 is used to obtain the current howling detection result corresponding to the current audio signal and the current voice detection result, and determines the subband gain coefficient corresponding to the current audio signal based on the current howling detection result and the current voice detection result;
  • the gain determination module 1508 is used to obtain the historical subband gain corresponding to the audio signal of the historical time period, and calculate the current subband gain corresponding to the current audio signal based on the subband gain coefficient and the historical subband gain;
  • the howling suppression module 1510 is configured to perform howling suppression on the target subband based on the current subband gain to obtain the first target audio signal corresponding to the current time period.
  • the signal transformation module 1502 includes:
  • an echo cancellation unit used for collecting the initial audio signal corresponding to the current time period, performing echo cancellation on the initial audio signal, and obtaining the initial audio signal after echo cancellation
  • a voice detection unit for performing voice endpoint detection on the initial audio signal after echo cancellation to obtain a current voice detection result
  • a noise suppression unit configured to perform noise suppression on the initial audio signal after echo cancellation based on the current speech detection result, to obtain the initial audio signal after noise suppression
  • the howling detection unit is used to perform howling detection on the initial audio signal after noise suppression, and obtain the current howling detection result
  • the current audio signal determining unit is configured to use the initial audio signal after noise suppression as the current audio signal corresponding to the current time period when the current howling detection result is that there is a howling signal in the initial audio signal after noise suppression.
  • the voice detection unit is further configured to input the echo-cancelled initial audio signal into a voice endpoint detection model for detection to obtain a current voice detection result.
  • the voice endpoint detection model is based on the training audio signal and the corresponding training voice.
  • the detection results are obtained by training the neural network algorithm.
  • the voice detection unit is further configured to perform low-pass filtering on the initial audio signal after echo cancellation to obtain a low-frequency signal; calculate the signal energy corresponding to the low-frequency signal, calculate the energy fluctuation based on the signal energy, and determine the current voice according to the energy fluctuation Test results.
  • the voice detection unit is further configured to perform low-pass filtering on the initial audio signal after echo cancellation to obtain a low-frequency signal; perform fundamental tone detection on the low-frequency signal to obtain a fundamental tone period, and determine the current voice detection result according to the fundamental tone period.
  • the howling detection unit is further configured to input the noise-suppressed initial audio signal into the howling detection model for detection, and obtain the current howling detection result.
  • the howling detection model is based on the howling training audio signal and the howling detection model.
  • the corresponding training whistle detection results are obtained by training the neural network algorithm.
  • the howling detection unit is further configured to extract the initial audio feature corresponding to the initial audio signal after noise suppression; acquire the first historical audio signal corresponding to the first historical time period, and extract the corresponding first historical audio signal The first historical audio feature; calculate the first similarity between the initial audio feature and the first historical audio feature, and determine the current howling detection result based on the first similarity.
  • the howling suppression method further includes:
  • the watermark adding module is used to obtain the audio signal to be played and the preset audio watermark signal when the current howling detection result is that there is a howling signal in the current audio signal; add the preset audio watermark signal to the audio signal to be played and perform play.
  • the howling suppression method further includes:
  • a watermark detection module configured to collect a first audio signal corresponding to a first time period, perform audio watermark detection based on the first audio signal, and determine that the first audio signal contains a target audio watermark signal;
  • the signal obtaining module is used for receiving the target network coding audio signal corresponding to the second time period, decoding the target network coding audio signal, and obtaining the target network audio signal;
  • the current audio signal determining module is configured to use the target network audio signal as the current audio signal based on the target audio watermark signal contained in the first audio signal.
  • the signal transformation module 1502 includes:
  • the network signal obtaining module is used to receive the current network encoded audio signal corresponding to the current time period, decode the network encoded audio signal, and obtain the current network audio signal;
  • the network signal detection module is used to perform voice endpoint detection on the current network audio signal to obtain the network voice detection result, and perform howling detection on the current network audio signal to obtain the network howling detection result;
  • a feature extraction module for extracting the network audio feature of the current network audio signal, and obtaining the second historical audio signal of the second historical time period, and extracting the second historical audio feature corresponding to the second historical audio signal;
  • the current audio signal obtaining module is used to calculate the network audio similarity between the network audio feature and the second historical audio feature, and determine that the network audio signal is the current audio signal corresponding to the current time period based on the network audio similarity and the network howling detection result.
  • the subband determination module 1504 is further configured to divide the frequency domain audio signal according to the preset number of subbands to obtain each subband; calculate the subband energy corresponding to each subband, and calculate the subband energy corresponding to each subband. Perform smoothing to obtain the smoothed energy of each sub-band; determine the target sub-band based on the smoothed energy of each sub-band.
  • the sub-band determination module 1504 is further configured to obtain the current howling detection result corresponding to the current audio signal, determine each howling sub-band from each sub-band according to the current howling detection result, and obtain each howling sub-band Band energy; select the target energy from each whistling sub-band energy, and use the target whistling sub-band corresponding to the target energy as the target sub-band.
  • the coefficient determination module 1506 is further configured to obtain a preset decreasing coefficient when the current voice detection result is that the current audio signal does not contain a voice signal and the current howling detection result is that the current audio signal contains a whistling signal,
  • the preset decreasing coefficient is used as the sub-band gain coefficient corresponding to the current audio signal; when the current voice detection result is that the current audio signal contains a voice signal and the current howling detection result is that the current audio signal contains a whistling signal, obtain the preset No.
  • the preset first increment coefficient is used as the subband gain coefficient corresponding to the current audio signal; when the current howling detection result is that the current audio signal does not contain a howling signal, the preset second increment coefficient is obtained, The second increment coefficient is set as the subband gain coefficient corresponding to the current audio signal, wherein the preset first increment coefficient is greater than the preset second increment coefficient.
  • the howling suppression method further includes:
  • a signal dividing module used for determining the target low frequency signal and the target high frequency signal from the current audio signal based on the preset low frequency range
  • the low-frequency energy calculation module is used to calculate the low-frequency energy corresponding to the target low-frequency signal, and smooth the low-frequency energy to obtain the smoothed low-frequency energy;
  • the high-frequency energy calculation module is used to divide the target high-frequency signal, obtain each high-frequency sub-band, and calculate the high-frequency sub-band energy corresponding to each high-frequency sub-band;
  • the upper limit energy calculation module is used to obtain the preset energy upper limit weight corresponding to each high frequency subband, and calculate the high frequency corresponding to each high frequency subband based on the preset energy upper limit weight corresponding to each high frequency subband and the smoothed low frequency energy subband upper limit energy;
  • the upper limit gain determination module is used to calculate the ratio of the upper limit energy of the high frequency subband to the energy of the high frequency subband, and obtain the upper limit gain of each high frequency subband;
  • the target audio signal obtaining module is used to calculate the gain of each high frequency subband corresponding to each high frequency subband, determine the target gain of each high frequency subband based on the upper limit gain of each high frequency subband and the gain of each high frequency subband, and determine the target gain of each high frequency subband based on each high frequency subband.
  • the frequency subband target gain suppresses howling of each high frequency subband to obtain a second target audio signal corresponding to the current time period.
  • Each module in the above-mentioned howling suppression device can be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 16 .
  • the computer equipment includes a processor, memory, a communication interface, a display screen, and an input device connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system and computer-readable instructions.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the communication interface of the computer device is used for wired or wireless communication with an external terminal, and the wireless communication can be realized by WIFI, operator network, NFC (Near Field Communication) or other technologies.
  • the computer readable instructions when executed by a processor, implement a howling suppression method.
  • the display screen of the computer equipment may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball or a touchpad set on the shell of the computer equipment , or an external keyboard, trackpad, or mouse.
  • FIG. 16 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device including a memory and a processor, where computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the steps in the foregoing method embodiments are implemented.
  • a non-volatile computer-readable storage medium stores computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform any of the above The steps of the howling suppression method in an embodiment.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps in the foregoing method embodiments.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephone Function (AREA)

Abstract

本申请涉及一种啸叫抑制方法、装置、计算机设备和存储介质。所述方法包括:获取当前时间段对应的当前音频信号,将当前音频信号进行频域变换,得到频域音频信号;对频域音频信号进行划分,得到各个子带,从各个子带中确定目标子带;获取当前音频信号对应的当前啸叫检测结果和当前语音检测结果,基于当前啸叫检测结果和当前语音检测结果确定当前音频信号对应的子带增益系数;获取历史时间段的音频信号对应的历史子带增益,基于子带增益系数和历史子带增益计算当前音频信号对应的当前子带增益;基于当前子带增益对目标子带进行啸叫抑制,得到当前时间段对应的第一目标音频信号。

Description

啸叫抑制方法、装置、计算机设备和存储介质
本申请要求于2020年09月30日提交中国专利局,申请号为2020110622548,申请名称为“啸叫抑制方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种啸叫抑制方法、装置、计算机设备和存储介质。
背景技术
随着互联网通信技术的发展,能够基于网络进行语音通话,比如,各种即时通信应用语音通话。然而,在进行语音通话,特别是在语音会议时,常常有两个或者多个语音通话设备所处距离较近,比如,在同一个房间内,此时非常容易发生啸叫,进而影响语音通话的质量。目前,通常是通过调整语音通话设备之间的距离,来避免啸叫,然而,在无法进行距离调整时,会导致啸叫产生,从而使语音通话质量降低。
发明内容
根据本申请提供的各种实施例,提供一种啸叫抑制方法、装置、计算机设备和存储介质。
一种啸叫抑制方法,由计算机设备执行,所述方法包括:
获取当前时间段对应的当前音频信号,将当前音频信号进行频域变换,得到频域音频信号;
对频域音频信号进行划分,得到各个子带,从各个子带中确定目标子带;
获取当前音频信号对应的当前啸叫检测结果和当前语音检测结果,基于当前啸叫检测结果和当前语音检测结果确定当前音频信号对应的子带增益系数;
获取历史时间段的音频信号对应的历史子带增益,基于子带增益系数和历史子带增益计算当前音频信号对应的当前子带增益;及
基于当前子带增益对目标子带进行啸叫抑制,得到当前时间段对应的第一目标音频信号。
一种啸叫抑制装置,所述装置包括:
信号变换模块,用于获取当前时间段对应的当前音频信号,将当前音频信号进行频域变换,得到频域音频信号;
子带确定模块,用于对频域音频信号进行划分,得到各个子带,从各个子带中确定目标子带;
系数确定模块,用于获取当前音频信号对应的当前啸叫检测结果和当前语音检测结果,基于当前啸叫检测结果和当前语音检测结果确定当前音频信号对应的子带增益系数;
增益确定模块,用于获取历史时间段的音频信号对应的历史子带增益,基于子带增益系数和历史子带增益计算当前音频信号对应的当前子带增益;及
啸叫抑制模块,用于基于当前子带增益对目标子带进行啸叫抑制,得到当前时间段对应的第一目标音频信号。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,所述计算 机可读指令被所述处理器执行时,使得所述处理器执行时实现以下步骤:
获取当前时间段对应的当前音频信号,将当前音频信号进行频域变换,得到频域音频信号;
对频域音频信号进行划分,得到各个子带,从各个子带中确定目标子带;
获取当前音频信号对应的当前啸叫检测结果和当前语音检测结果,基于当前啸叫检测结果和当前语音检测结果确定当前音频信号对应的子带增益系数;
获取历史时间段的音频信号对应的历史子带增益,基于子带增益系数和历史子带增益计算当前音频信号对应的当前子带增益;及
基于当前子带增益对目标子带进行啸叫抑制,得到当前时间段对应的第一目标音频信号。
一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行时实现以下步骤:
获取当前时间段对应的当前音频信号,将当前音频信号进行频域变换,得到频域音频信号;
对频域音频信号进行划分,得到各个子带,从各个子带中确定目标子带;
获取当前音频信号对应的当前啸叫检测结果和当前语音检测结果,基于当前啸叫检测结果和当前语音检测结果确定当前音频信号对应的子带增益系数;
获取历史时间段的音频信号对应的历史子带增益,基于子带增益系数和历史子带增益计算当前音频信号对应的当前子带增益;及
基于当前子带增益对目标子带进行啸叫抑制,得到当前时间段对应的第一目标音频信号。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中啸叫抑制方法的应用环境图;
图2为一个实施例中啸叫抑制方法的流程示意图;
图2a为一个具体实施例中音频信号的频率与能量的关系示意图;
图3为一个实施例中得到当前音频信号的流程示意图;
图4为一个实施例中啸叫检测的流程示意图;
图5为另一个实施例中得到当前音频信号的流程示意图;
图6为又一个实施例中得到当前音频信号的流程示意图;
图7为一个实施例中得到子带增益系数的流程示意图;
图8为一个实施例中得到第二目标音频信号的流程示意图;
图8a为一个具体实施例中能量约束的曲线示意图;
图9为一个具体实施例中啸叫抑制方法的流程示意图;
图10为一个具体实施例中啸叫抑制方法的应用场景示意图;
图11为一个具体实施例中啸叫抑制方法的应用框架示意图;
图12为一个具体实施例中啸叫抑制方法的流程示意图;
图13为另一个具体实施例中啸叫抑制方法的应用框架示意图;
图14为又一个具体实施例中啸叫抑制方法的应用框架示意图;
图15为一个实施例中啸叫抑制装置的结构框图;
图16为一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请实施例提供的啸叫抑制方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器106进行通信,终端104通过网络与服务器106进行通信,终端102与终端104通过服务器106进行语音通话,终端102和终端104距离较近,比如,在同一个房间内。终端102和终端104即可以是发送语音的发送终端,也可以接收语音的接收终端。终端102或者终端104获取当前时间段对应的当前音频信号,将当前音频信号进行频域变换,得到频域音频信号;终端102或者终端104对频域音频信号进行划分,得到各个子带,从各个子带中确定目标子带;终端102或者终端104获取当前音频信号对应的当前啸叫检测结果和当前语音检测结果,基于当前啸叫检测结果和当前语音检测结果确定当前音频信号对应的子带增益系数;终端102或者终端104获取历史时间段的音频信号对应的历史子带增益,基于子带增益系数和历史子带增益计算当前音频信号对应的当前子带增益;终端102或者终端104基于当前子带增益对目标子带进行啸叫抑制,得到当前时间段对应的第一目标音频信号。其中,终端可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一个实施例中,如图2所示,提供了一种啸叫抑制方法,以该方法应用于图1中的终端为例进行说明,可以理解的是,该方法也可以应用在服务器中,还可以应用于包括终端和服务器的系统,并通过终端和服务器的交互实现。在本实施例中,包括以下步骤:
步骤202,获取当前时间段对应的当前音频信号,将当前音频信号进行频域变换,得到频域音频信号。
其中,音频信号是带有语音、音乐和音效等的声波的频率、幅度变化信息载体。当前音频信号是指需要进行啸叫抑制的音频信号,即当前音频信号中存在啸叫信号。声源与扩音设备之间因距离过近等问题导致能量发生自激,产生啸叫,啸叫信号是指啸叫对应的音频信号,啸叫往往比较尖锐刺耳。当前音频信号可以是通过麦克风等采集设备采集音频信号并进行信号处理后得到的需要进行啸叫抑制的音频信号,该信号处理可以包括回声消除、噪声抑制、啸叫检测等等。回声消除是指是透过音波干扰方式消除麦克风等采集设备与喇叭等播放设备因空气产生回受路径而产生的杂音。噪声抑制是指从含噪音频中提取到纯净的原始音频,未含有背景噪声的音频信号。啸叫检测是指检测音频信号中是否存在啸叫信号。当前音频信号也可以是通过网络接收到音频信号并进行处理后得到的需要进行啸叫抑制的音频信号,该信号处理可以是进行啸叫检测。当前时间段是指当前音频信号所处的时间段,即对音频信号进行语音分帧后的时间段,比如,当前时间段的长度可以是在10ms到30ms内。频域变换是指将当前音频信号从时域变换到频域,时域用于描述音频信号与时间的关系,音频信号的时域 波形可以表达音频信号随着时间的变化,频域是描述信号在频率方面特性时用到的一种坐标系,是指音频信号随着频率变化。频域图显示了在一个频率范围内每个给定频带内的信号量。频域表示还可以包括每个正弦曲线相移的信息,以便能够重新组合频率分量以恢复原始时间信号。频域音频信号是指将当前音频信号从时域变换到频域后得到的音频信号。
可行地,终端可以通过麦克风等采集设备进行语音采集,得到当前时间段的音频信号,然后对音频信号进行啸叫检测,其中,可以通过神经网络建立的机器学习模型对啸叫进行检测,也可以通过峰值/均值比等参数准则对啸叫进行检查。也可以基于音频信号中的基音周期对啸叫进行检测。还可以基于音频信号中的能量对啸叫进行检测。
当音频信号中存在啸叫信号时,得到当前时间段对应的当前音频信号。然后通过傅里叶变换将当前音频信号进行频域变换,得到频域音频信号。其中,终端在对采集的音频信号进行啸叫检测之前,还可以对采集的音频信号进行回声消除和噪声抑制等处理。
终端也可以通过网络获取到其他语音通话终端发送的语音,得到当前时间段的音频信号,然后对音频信号进行啸叫检测,当音频信号中存在啸叫信号时,得到当前时间段对应的当前音频信号,再通过傅里叶变换将当前音频信号进行频域变换,得到频域音频信号。在一个实施例中,终端也可以获取到服务器下发的音频信号,然后对音频信号进行啸叫检测,当音频信号中存在啸叫信号时,得到当前时间段对应的当前音频信号。
步骤204,对频域音频信号进行划分,得到各个子带,从各个子带中确定目标子带。
其中,子带是指将频域音频信号进行分割得到的子频带。目标子带是指需要进行啸叫抑制的子带。
可行地,终端对频域音频信号进行划分,可以使用带通滤波器将频域音频信号进行分割,得到各个子带,其中,子带的分割可以按照预先设置好的子带数量进行划分,也可以按照预先设置好的频带范围进行划分等等。然后计算各个子带的能量,根据各个子带的能量选取目标子带。其中,选取的目标子带可以是一个,比如最大能量的子带为目标子带,也可以是多个,比如,选取的目标子带可以是按照子带的能量从大到小依次选取预设数量的子带。
步骤206,获取当前音频信号对应的当前啸叫检测结果和当前语音检测结果,基于当前啸叫检测结果和当前语音检测结果确定当前音频信号对应的子带增益系数。
其中,当前啸叫检测结果是指对当前音频信号进行啸叫检测后得到的检测结果,可以包括当前音频信号中存在啸叫信号和当前音频信号中未存在啸叫信号。当前语音检测结果是指对当前音频信号进行语音端点检测后得到的检测结果,其中语音端点检测(Voice Activity Detection,VAD)是指从当前音频信号中准确的定位出语音的开始和结束。该当前语音检测结果可以包括当前音频信号中存在语音信号和当前音频信号中未存在语音信号。子带增益系数用于表示当前音频信号需要进行啸叫抑制的程度。当子带增益系数越小时,说明需要对当前音频信号进行啸叫抑制的程度越高。当子带增益系数越大时,说明需要对当前音频信号进行啸叫抑制的程度越小。
可行地,终端可以获取到当前音频信号对应的当前啸叫检测结果和当前语音检测结果,该当前音频信号对应的当前啸叫检测结果和当前语音检测结果可以是在对当前音频信号进行啸叫抑制之前进行啸叫检测和语音端点检测,得到当前啸叫检测结果和当前语音检测结果并保存到内存中的。
终端也可以从第三方获取到当前音频信号对应的当前啸叫检测结果和当前语音检测结果,该第三方是对当前音频信号进行啸叫检测和语音端点检测的服务方。比如,终端可以从 服务器中获取到保存的当前音频信号对应的当前啸叫检测结果和当前语音检测结果。
步骤208,获取历史时间段的音频信号对应的历史子带增益,基于子带增益系数和历史子带增益计算当前音频信号对应的当前子带增益。
其中,历史时间段是指当前时间段对应的历史时间段,该历史时间段的时间长度可以与当前时间段的时间长度相同,也可以与当前时间段的时间长度不同。该历史时间段可以是当前时间段的前一个时间段,也可以是当前时间段之前的多个时间段。该历史时间段可以与当前时间段存在预设间隔,也可以直接与当前时间段相连。比如,在0ms到100ms的时间内,当前时间段可以是80ms到100ms,历史时间段可以是60ms到80ms的时间段。历史时间段的音频信号是指已经进行啸叫抑制后的音频信号。历史子带增益是指历史时间段的音频信号进行啸叫抑制时使用的子带增益。当前子带增益是指对当前音频信号进行啸叫印制时使用的子带增益。
可行地,终端可以从内存中获取到历史时间段的音频信号对应的历史子带增益,计算子带增益系数与历史子带增益的乘积,得到当前音频信号对应的当前子带增益。其中,若当前时间段为起始时间段时,历史子带增益为预先设置好的初始子带增益值,比如,该初始子带增益值可以为1,初始子带增益值为1说明不会对当前音频信号进行抑制。当子带增益系数小于一时,说明需要对当前音频信号进行啸叫抑制,当子带增益系数大于一时,说明需要减少对当前音频信号的啸叫抑制。
在一个实施例中,将当前音频信号对应的当前子带增益与预设子带增益的下限值进行比较,当当前音频信号对应的当前子带增益小于预设子带增益的下限值时,将预设子带增益的下限值作为当前音频信号对应的当前子带增益。
在一个实施例中,将当前音频信号对应的当前子带增益与初始子带增益值进行比较,当当前音频信号对应的当前子带增益大于初始子带增益值时,将初始子带增益值作为当前音频信号对应的当前子带增益。
步骤210,基于当前子带增益对目标子带进行啸叫抑制,得到当前时间段对应的第一目标音频信号。
其中,第一目标音频信号是指对当前音频信号中的目标子带进行啸叫抑制后得到的音频信号。
可行地,使用当前子带增益对目标子带的频谱进行增益,然后将增益后的音频信号使用反傅里叶变换算法从频域转换到时域得到当前时间段对应的第一目标音频信号。
在一个实施例中,该当前时间段对应的当前音频信号是通过终端的麦克风等采集设备采集到的,则可以将当前时间段对应的第一目标音频信号进行编码,得到编码后的音频信号,然后将编码后的音频信号通过网络接口发送到其他进行语音通话的终端。比如,如图1所示,终端102通过麦克风采集音频信号,进行回声消除和噪声抑制后,得到当前时间段对应的当前音频信号,然后对当前音频信号进行啸叫抑制后,得到当前时间段对应的第一目标音频信号,将当前时间段对应的第一目标音频信号通过服务器106发送到终端104中,终端104接收到当前时间段对应的第一目标音频信号进行解码,然后对解码后的第一目标音频信号进行播放。
在一个实施例中,在得到第一目标音频信号后,还可以调整第一目标音频信号的音量大小,比如,可以增大第一目标音频信号的音量,然后将增大音量的第一目标音频信号进行编码,再将编码后的第一目标音频信号通过网络接口发送到其他进行语音通话的终端。
在一个实施例中,该当前时间段对应的当前音频信号是其他语音通话终端通过网络接口发送的。则可以直接将当前时间段对应的第一目标音频信号进行语音播放。比如,如图1所示,终端102通过麦克风采集音频信号,进行回声消除和噪声抑制后,将音频信号编码并通过服务器106发送到终端104中,终端104接收到编码后的音频信号进行解码,得到解码后的音频信号,对解码后的音频信号进行处理,得到当前时间段对应的当前音频信号,然后对当前音频信号进行啸叫抑制后,得到当前时间段对应的第一目标音频信号,然后将第一目标音频信号进行播放。
在一个可行地实施例中,如图2a所示,为音频信号的频率与能量的关系示意图。其中,该示意图中横坐标表示频率,纵坐标表示能量,基于频率划分得到不同的子带,图中示出了9个子带,频率低于1400HZ的子带为低频子带,高于1400HZ的子带为高频子带,低频子带为第1子带到第4子带,高频带为第5子带到第9子带。该图中实线表示在只有语音信号时频率与能量的关系曲线。虚线表示音频信号中有语音信号和啸叫信号时,频率与能量的关系曲线,可以看到音频信号中有语音信号和啸叫信号时能量明显比只有语音信号时的能量多。此时,在高频子带中,得到第8个子带的能量最多,则确定第8个子带为目标子带。对目标子带进行啸叫抑制。由于对第8个子带进行了啸叫抑制,第8个子带的能量逐渐下降,直到第6个子带的能量为最大子带能量,确定第6个子带为目标子带,然后对第6个子带进行啸叫抑制。
上述啸叫抑制方法,通过获取当前时间段对应的当前音频信号,再获取到当前音频信号对应的当前啸叫检测结果和当前语音检测结果,从而能够根据当前啸叫检测结果和当前语音检测结果确定当前音频信号对应的子带增益系数,并通过子带增益系数和历史子带增益计算当前音频信号对应的当前子带增益,从而使得到的当前子带增益更加的准确,然后使用当前子带增益对目标子带进行啸叫抑制,从而能够准确的对啸叫进行抑制,提高了得到的当前时间段对应的第一目标音频信号的质量,从而能够提高语音通话质量
在一个实施例中,如图3所示,步骤202,获取当前时间段对应的当前音频信号,包括:
步骤302,采集当前时间段对应的初始音频信号,对初始音频信号进行回声消除,得到回声消除后的初始音频信号。
其中,初始音频信号是指通过麦克风等采集设备采集到用户语音后转换得到的数字音频信号。
可行地,当终端是发送语音的发送终端时,终端采集当前时间段对应的初始音频信号,使用回声消除算法对初始音频信号进行回声消除,得到回声消除后的初始音频信号,其中,回声消除可以是通过自适应算法来估计期望信号,该期望信号逼近经过实际回声路径的回声信号,即模拟回声信号,然后从麦克风等采集设备采集的初始音频信号中减去模拟回声,得到回声消除后的初始音频信号。该回声消除算法包括LMS(Least Mean Square,最小均方自适应滤波)算法、RLS(Recursive Least Square,递归最小二乘自适应滤波)算法和APA(Affine Projection Algorithm,仿射投影自适应滤波)算法中的至少一种。
步骤304,将回声消除后的初始音频信号进行语音端点检测,得到当前语音检测结果。
可行地,终端使用语音端点检测算法将回声消除后的初始音频信号进行语音端点检测,得到当前语音检测结果。其中,语音端点检测算法包括双门限检测法、基于能量的端点检测算法、基于倒谱系数的端点检测算法、基于频带方差的端点检测算法、基于自相关相似距离的端点检测算法、基于信息熵的端点检测算法等等。
步骤306,基于当前语音检测结果对回声消除后的初始音频信号进行噪声抑制,得到噪声抑制后的初始音频信号。
可行地,当前语音检测结果为回声消除后的初始音频信号中未包含语音信号时,对回声消除后的初始音频信号进行噪声估计并对噪声进行抑制,得到噪声抑制后的初始音频信号,其中,可以使用训练好的用于去除噪声的神经网络模型进行噪声抑制,也可以使用滤波器进行噪声抑制。当前语音检测结果为回声消除后的初始音频信号中包含语音信号时,尽量保留语音信号的同时进行噪声抑制,得到噪声抑制后的初始音频信号。语音信号是指用户语音对应的信号。
步骤308,对噪声抑制后的初始音频信号进行啸叫检测,得到当前啸叫检测结果。
可行地,终端使用啸叫检测算法对噪声抑制后的初始音频信号进行啸叫检测,得到当前啸叫检测结果,其中,啸叫检测算法可以是基于能量分布的检测算法,比如峰值谐波功率比算法、峰临比算法、帧间峰值保持都算法等等。也可以是基于神经网络的检测算法等等。
步骤310,当当前啸叫检测结果为噪声抑制后的初始音频信号中存在啸叫信号时,将噪声抑制后的初始音频信号作为当前时间段对应的当前音频信号。
可行地,当终端检测到噪声抑制后的初始音频信号中存在啸叫信号时,就将噪声抑制后的初始音频信号作为当前时间段对应的当前音频信号,然后对当前时间段对应的当前音频信号进行啸叫抑制。
在上述实施例中,通过对采集的初始音频信号进行回声消除,并对回声消除后的初始音频信号进行语音端点检测,基于当前语音检测结果进行噪声抑制,对噪声抑制后的初始音频信号进行啸叫检测,当检测到噪声抑制后的初始音频信号中存在啸叫信号,就将噪声抑制后的初始音频信号作为当前时间段对应的当前音频信号,保证得到的当前音频信号是需要进行啸叫抑制的音频信号。
在一个实施例中,步骤304,步骤将回声消除后的初始音频信号进行语音端点检测,得到当前语音检测结果,包括:
将回声消除后的初始音频信号输入到语音端点检测模型中进行检测,得到当前语音检测结果,语音端点检测模型是基于训练音频信号和对应的训练语音检测结果使用神经网络算法进行训练得到的。
其中,神经网络算法可以是BP((back propagation,前馈神经网络)神经网络算法、LSTM(Long Short-Term Memory,长短期记忆人工神经网络)算法、RNN(Recurrent Neural Network,循环神经网络)神经网络算法等等。训练音频信号是指训练语音端点检测模型时使用的音频信号,训练语音检测结果是指训练音频信号对应的语音检测结果,该训练语音检测结果包括训练音频信号中包含语音信号和训练音频信号中未包含语音信号,其中,损失函数使用交叉熵损失函数、采用梯度下降法进行优化,激活函数使用S型函数。
可行地,终端使用小波分析提取回声消除后的初始音频信号中的音频特征,该音频特征包括短时过零率,短时能量,短时幅度谱的峰度,短时幅度谱的偏度等等,将音频特征输入到语音端点检测模型中进行检测,得到输出的当前语音检测结果。当前语音检测结果包括回声消除后的初始音频信号中包含语音信号和回声消除后的初始音频信号中未包含语音信号。该语音端点检测模型是基于训练音频信号和对应的训练语音检测结果使用神经网络算法进行训练得到的。可以是在服务器中基于训练音频信号和对应的训练语音检测结果使用神经网络算法进行训练得到并保存,终端从服务器中获取到语音端点检测模型进行使用。也可以在终 端中基于训练音频信号和对应的训练语音检测结果使用神经网络算法进行训练得到。
在一个实施例中,步骤304,即将回声消除后的初始音频信号进行语音端点检测,得到当前语音检测结果,包括:
将回声消除后的初始音频信号进行低通滤波,得到低频信号;计算低频信号对应的信号能量,基于信号能量计算能量波动,根据能量波动确定当前语音检测结果。
其中,低通滤波是指是一种过滤方式,规则为低频信号能正常通过,而超过设定临界值的高频信号则被阻隔、减弱。但是阻隔、减弱的幅度则会依据不同的频率以及不同的滤波程序(目的)而改变。信号能量是指低频信号对应的短时能量。能量波动是指前一帧低频信号与后一帧低频信号之间的信号能量比值。
可行地,由于音频信号中语音信号和啸叫信号的能量分布不同,并且啸叫信号中低频能量明显弱于语音信号。则终端按照预先设置好的低频值将回声消除后的初始音频信号进行低通滤波,得到低频信号,该预先设置好的低频值可以是500HZ。然后计算低频信号中每一帧对应的信号能量,可以使用三角滤波计算信号能量。然后计算前一帧对应的信号能量与后一帧对应的信号能量之间的比值,当比值超过预设的能量比值时,说明回声消除后的初始音频信号中包含语音信号,当比值未超过预设的能量比值时,说明回声消除后的初始音频信号中未包含语音信号,从而得到当前语音检测结果。
在上述实施例中,通过将回声消除后的初始音频信号进行低通滤波,得到低频信号,然后根据低频信号的能量波动确定当前语音检测结果,能够使得到的当前语音检测结果更加的准确。
在一个实施例中,步骤304,即将回声消除后的初始音频信号进行语音端点检测,得到当前语音检测结果,包括步骤:
将回声消除后的初始音频信号进行低通滤波,得到低频信号,对低频信号进行基音检测,得到基音周期,根据基音周期确定当前语音检测结果。
其中,一般的声音都是由发音体发出的一系列频率、振幅各不相同的振动复合而成的。这些振动中有一个频率最低的振动,由它发出的音就是基音,其余为泛音。基音检测是指对基音周期的估计,用于检测到与声带振动频率完全一致或尽可能相吻合的轨迹曲线。基音周期是指声带每开启和闭合一次的时间。
可行地,终端将回声消除后的初始音频信号进行低通滤波,得到低频信号,使用基音检测算法对低频信号进行基音检测,得到基音周期,其中,基音检测算法可以包括自相关法、平均幅度差函数法、并行处理法、倒谱法和简化逆滤波法等等。然后根据基音周期确定回声消除后的初始音频信号是否包含语音信号,即如果能检测到基音周期,说明回声消除后的初始音频信号中包含语音信号,如果未能检测到基音周期,说明回声消除后的初始音频信号中未包含语音信号,从而得到当前语音检测结果。
在上述实施例中,通过检测基音周期来得到当前语音检测结果,提高了得到当前语音检测结果的准确性。
在一个实施例中,步骤308,即对噪声抑制后的初始音频信号进行啸叫检测,得到当前啸叫检测结果,包括步骤:
将噪声抑制后的初始音频信号输入到啸叫检测模型中进行检测,得到当前啸叫检测结果,啸叫检测模型是基于啸叫训练音频信号和对应的训练啸叫检测结果使用神经网络算法进行训练得到的。
其中,神经网络算法可以是BP((back propagation,前馈神经网络)神经网络算法、LSTM(Long Short-Term Memory,长短期记忆人工神经网络)算法、RNN(Recurrent Neural Network,循环神经网络)神经网络算法等等。啸叫训练音频信号是指训练啸叫检测模型时使用的音频信号。训练啸叫检测结果是指啸叫训练音频信号对应的啸叫检测结果,包括噪声抑制后的初始音频信号中包含啸叫信号和噪声抑制后的初始音频信号中未包含啸叫信号。
可行地,终端可以提取噪声抑制后的初始音频信号对应的音频特征,该音频特征包括MFCC(Mel-Frequency cepstrum coefficients,梅尔频率倒谱系数)动态特征、频带表示向量(band representative vectors)以及各种类型的音频指纹,梅尔频率倒谱系数是指组成梅尔频率倒谱的系数。该音频指纹是指通过特定的算法将噪声抑制后的初始音频信号中的数字特征以标识符的形式提取得到的,频带表示向量是一个有序的频带中突出音调的索引列表。终端将提取到的音频特征输入到啸叫检测模型中进行检测,得到当前啸叫检测结果。
在上述实施例中,通过使用啸叫检测模型对噪声抑制后的初始音频信号进行啸叫检测,提高了检测啸叫的效率和准确性。
在一个实施例中,如图4所示,步骤308,即对噪声抑制后的初始音频信号进行啸叫检测,得到当前啸叫检测结果,包括:
步骤402,提取噪声抑制后的初始音频信号对应的初始音频特征。
其中,初始音频特征是指从噪声抑制后的初始音频信号提取的音频特征,该初始音频特征包括梅尔频率倒谱系数(MFCC,Mel-Frequency cepstrum coefficients)动态特征、频带表示向量(band representative vectors)以及各种类型的音频指纹中的至少一种。
在一个实施例中,终端也可以根据准确性和计算量来选取对应的音频特征,当终端计算资源受限时,频带表示向量以及各种类型的音频指纹作为初始音频特征,当需要较高的准确性时,可以使用梅尔频率倒谱系数动态特征、频带表示向量以及各种类型的音频指纹即全部作为初始音频特征。
可行地,终端提取噪声抑制后的初始音频信号对应的初始音频特征,比如,提取梅尔频率倒谱系数动态特征可以对噪声抑制后的初始音频信号进行预加重,然后进行分帧,对每一帧进行加窗处理,对加窗处理后的结果进行快速傅里叶变换,得到变换后的结果,对变换后的结果通过三角滤波计算对数能量,然后经离散余弦变换后得到梅尔频率倒谱系数动态特征。
步骤404,获取第一历史时间段对应的第一历史音频信号,并提取第一历史音频信号对应的第一历史音频特征。
其中,第一历史时间段是指当前时间段之前的时间段,并且与当前时间段的时间长度相同,该第一历史时间段可以有多个。比如,当前通话了2500ms,当前时间段的长度为300ms,即当前时间段为2200ms到2500ms,并且,预先设置好的间隔为20ms,则第一历史时间段可以是200ms~500ms,220ms~520ms,240ms~540ms,…,1980ms~2280ms以及2000~2300ms。第一历史音频信号是指第一历史时间段对应的历史音频信号,是在第一历史时间段通过麦克风采集到的音频信号。第一历史音频特征是指第一历史音频信号对应的音频特征,可以包括梅尔频率倒谱系数(MFCC,Mel-Frequency cepstrum coefficients)动态特征、频带表示向量(band representative vectors)以及各种类型的音频指纹中的至少一种。
可行地,终端可以从缓存中获取第一历史时间段对应的第一历史音频信号,也可以从服务器中下载到第一历史时间段对应的第一历史音频信号。然后提取第一历史音频信号对应的第一历史音频特征。
步骤406,计算初始音频特征与第一历史音频特征的第一相似度,基于第一相似度确定当前啸叫检测结果。
其中,第一相似度是指初始音频特征与第一历史音频特征的相似度,该相似度可以是距离相似度,也可以是余弦相似度。
可行地,终端可以使用相似度算法计算初始音频特征与第一历史音频特征的第一相似度,当第一相似度超过预先设置好的第一相似度阈值时,说明噪声抑制后的初始音频信号中存在啸叫信号,当第一相似度未超过预先设置好的第一相似度阈值时,说明噪声抑制后的初始音频信号中未存在啸叫信号,从而得到当前啸叫检测结果。
在一个实施例中,当第一历史时间段有多个时,可以获取到多个第一历史音频信号,分别计算每个第一历史音频信号对应的第一历史音频特征,并分别计算每个第一历史音频特征与初始音频特征之间的第一相似度,统计第一相似度超过预先设置好的第一相似度阈值的持续时长,当该持续时长超过预先设置好的时长时,说明噪声抑制后的初始音频信号中存在啸叫信号,当该持续时长未超过预先设置好的时长时,说明噪声抑制后的初始音频信号中未存在啸叫信号,从而得到当前啸叫检测结果。
在上述实施例中,通过计算初始音频特征与第一历史音频特征的第一相似度,由于啸叫信号语音发送终端和语音接收终端中循环的传输,因此,具有历史相似度,然后基于第一相似度确定当前啸叫检测结果,从而使得到的当前啸叫检测结果更加准确。
在一个实施例中,啸叫抑制方法,还包括步骤:
当当前啸叫检测结果为当前音频信号中存在啸叫信号时,获取待播放音频信号和预设音频水印信号,将预设音频水印信号添加到待播放音频信号中并进行播放。
其中,待播放音频信号是指终端在用户说话的同时通过播放设备即将播放的音频信号,该音频信号是人耳可察觉(例如播放对方的说话声)或不易察觉(例如对方没有说话时的安静背景声)的信号。预设音频水印信号是指预先设置好的用于表征通过网络发送的音频信号中存在啸叫信号的音频信号,是人耳不易察觉的信号,比如,预设音频水印信号可以是从高频段甚至超声段选取的高频水印信号。
可行地,由于在发送终端检测到啸叫并进行啸叫抑制,当存在多个接收语音信号的接收终端时,会导致所有接收语音信号的接收终端接收到的音频信号为啸叫抑制后的音频信号,影响所有接收终端的音频信号质量。此时发送终端在检测到当前音频信号中存在啸叫信号时,不进行啸叫抑制,获取待播放音频信号和预设音频水印信号,将预设音频水印信号添加到待播放音频信号中并进行播放,然后不对当前音频信号进行啸叫抑制,直接将当前音频信号通过网络发送到所有的接收终端。在一个实施例中,可以在待播放音频信号的高频带中嵌入一个预设频率的单频音或者多频音,作为预设高频水印信号。在一个实施例中,可以在待播放音频信号嵌入多个预设高频水印信号进行播放。在一个实施例中,也可以使用时域音频水印算法将预设音频水印信号添加到待播放音频信号中。在一个实施例中,也可以使用变换域音频水印算法将预设音频水印信号添加到待播放音频信号中。由于产生啸叫的接收终端与发送终端距离较近,此时产生啸叫的接收终端能够接收到添加了预设音频水印信号的音频信号以及当前音频信号。然后产生啸叫的接收终端通过检测添加了预设音频水印信号的音频信号,得到当前音频信号中存在啸叫信号的结果,然后对当前音频信号进行抑制,得到第一目标音频信号并进行播放,避免降低所有接收终端接收到的音频信号的质量。
在一个实施例中,如图5所示,啸叫抑制方法,还包括步骤:
502,采集第一时间段对应的第一音频信号,基于所述第一音频信号进行音频水印检测,确定第一音频信号中包含有目标音频水印信号。
其中,第一音频信号是指通过麦克风等采集设备采集到距离较近的终端通过播放设备发送的音频信号,该终端和距离较近的终端之间可能产生啸叫。音频水印检测可以是使用音频水印检测算法进行检测,音频水印检测算法用于检测第一音频信号中添加的音频水印信号,可以是邻带能量比算法,邻带能量比算法可以是是计算第一音频信号中每个子带对应的能量之间的比值,根据比值提取音频水印信号。目标音频水印信号是指距离较近的终端在第一音频信号中添加的预设音频水印信号。第一时间段是指第一音频信号对应的时间段。
可行地,当终端为接收语音的接收终端时,终端通过麦克风等采集设备采集到第一时间段对应的第一音频信号。将第一音频信号进行子带划分,并计算每个子带的能量,然后将相邻子带的能量进行比较,得到邻带能量比值,当邻带能量比值超过预设邻带能量比阈值时,确定第一音频信号中包含有目标高频水印信号,此时说明通过网络接收到的音频信号中含有啸叫信号,预设邻带能量比阈值是指预先设置好的邻带能量比的阈值,用于检测是否含有预先设置好的高频水印信号。在一个实施例中,也可以通过水印提取算法来检测第一音频信号中添加的音频水印信号。
506,接收第二时间段对应的目标网络编码音频信号,将目标网络编码音频信号进行解码,得到目标网络音频信号。
其中,第二时间段是指目标网络编码音频信号对应的时间段。该第二时间段在第一时间段之后。目标网络编码音频信号是指通过网络接收到的编码后的当前音频信号。目标网络音频信号是指对目标网络编码音频信号进行解码后得到的当前音频信号。
可行地,终端通过网络接收到第二时间段对应的目标网络编码音频信号,将目标网络编码音频信号进行解码,得到目标网络音频信号。
506,基于第一音频信号中包含目标音频水印信号将目标网络音频信号作为当前音频信号。
可行地,终端根据第一音频信号中包含目标音频水印信号将目标网络音频信号作为当前音频信号。
在上述实施例中,当终端为接收语音的终端时,可以通过采集的第一音频信号检测预设音频水印信号,当检测到第一音频信号中存在预设音频水印信号时,将通过网络接收到的目标网络音频信号作为当前音频信号,然后对当前音频信号进行啸叫抑制,避免影响所有终端接收到的音频信号质量,并且通过检测预设音频水印信号来确定是否将目标网络音频信号作为当前音频信号,提高了得到的当前音频信号的准确性。
在一个实施例中,如图6所示,步骤202,获取当前时间段对应的当前音频信号,包括:
步骤602,接收当前时间段对应的当前网络编码音频信号,将网络编码音频信号进行解码,得到当前网络音频信号。
其中,当前时间段是指终端通过网络接收到的当前网络编码音频信号的时间段。当前网络编码音频信号是指通过网络接收到的编码后的音频信号。
可行地,当终端为接收语音的终端时,终端通过网络接口接收到当前时间段对应的当前网络编码音频信号,将网络编码音频信号进行解码,得到当前网络音频信号。
步骤604,将当前网络音频信号进行语音端点检测,得到网络语音检测结果,同时对当前网络音频信号进行啸叫检测,得到网络啸叫检测结果。
其中,网络语言检测结果是指对当前网络音频信号进行语音端点检测得到的结果,包括当前网络音频信号中包含语音信号和当前网络音频信号中未包含语音信号。网络啸叫检测结果是指对当前网络音频信号进行啸叫检测得到的结果,可以包括当前网络音频信号包含啸叫信号和当前网络音频信号未包含啸叫信号。
在一个实施例中,通过语音端点检测模型对当前网络音频信号进行语音端点检测,得到网络语音检测结果,通过啸叫检测模型对当前网络音频信号进行啸叫检测,得到网络啸叫检测结果。
在一个实施例中,可以将当前网络音频信号进行低通滤波,得到低频信号,计算低频信号对应的信号能量,基于信号能量计算能量波动,根据所述能量波动确定当前网络音频信号对应的网络语音检测结果。
在一个实施例中,可以将当前网络音频信号进行低通滤波,得到低频信号,对低频信号进行基音检测,得到基音周期,根据基音周期确定当前网络音频信号对应的网络语音检测结果。
在一个实施例中,可以提取当前网络音频信号对应的当前网络音频特征,并获取到历史网络音频特征,计算历史网络音频特征与当前网络音频特征的相似度,基于相似度确定网络啸叫检测结果。
步骤606,提取当前网络音频信号的网络音频特征,并获取第二历史时间段的第二历史音频信号,提取第二历史音频信号对应的第二历史音频特征。
其中,网络音频特征是指当前网络音频信号对应的音频特征。第二历史时间段是指第二历史音频信号对应的时间段,可以有多个第二历史时间段。第二历史音频信号是指通过麦克风等采集设备采集得到的历史音频信号。第二历史音频特征是指第二历史音频信号对应的音频特征。
可行地,终端提取到当前网络音频信号的网络音频特征,并获取到内存中保存的第二历史时间段的第二历史音频信号,提取到第二历史音频信号对应的第二历史音频特征。
步骤608,计算网络音频特征与第二历史音频特征的网络音频相似度,基于网络音频相似度和网络啸叫检测结果确定网络音频信号为当前时间段对应的当前音频信号。
其中,网络音频相似度是指当前网络音频信号与第二历史音频信号的相似度程,网络音频相似度越高说明终端与发送当前网络音频信号的终端之间距离越近。
可行地,终端通过相似度算法计算网络音频特征与第二历史音频特征的网络音频相似度,当网络音频相似度超过预设网络音频相似度阈值并且网络啸叫检测结果为当前网络音频信号中存在啸叫信号时,将网络音频信号作为当前时间段对应的当前音频信号。其中,预设网络音频相似度阈值是用于确定终端与发送当前网络音频信号的终端位置的阈值,当网络音频相似度超过预设网络音频相似度阈值时,说明终端与发送当前网络音频信号的终端位置相近,容易产生啸叫。当网络音频相似度未超过预设网络音频相似度阈值时,说明终端与发送当前网络音频信号的终端位置较远,不易产生啸叫。
在一个实施例中,终端可以获取到多个第二历史音频信号,提取到每个第二历史音频信号对应的第二历史音频特征,分别计算每个第二历史音频特征与第二历史音频特征的网络音频相似度,当网络音频相似度超过预设网络音频相似度阈值的持续时长超过预设阈值时,说明终端与发送当前网络音频信号的终端位置相近,当网络音频相似度超过预设网络音频相似度阈值的持续时长未超过预设阈值时,说明终端与发送当前网络音频信号的终端位置较远, 其中多个是指至少两个。
在上述实施例中,通过计算网络音频特征与第二历史音频特征的网络音频相似度,基于网络音频相似度和网络啸叫检测结果确定网络音频信号为当前时间段对应的当前音频信号,从而使确定的当前音频信号更加的准确,
在一个实施例中,步骤204,对频域音频信号进行划分,得到各个子带,从各个子带中确定目标子带,包括:
按照预设子带个数将频域音频信号进行划分,得到各个子带。计算各个子带对应的子带能量,并对各个子带能量进行平滑,得到平滑后的各个子带能量。基于平滑后的各个子带能量确定目标子带。
其中,预设子带个数是预先设置好的要进行划分的子带个数。
可行地,终端按照预设子带个数将频域音频信号进行不均匀的划分,得到各个子带。终端然后计算各个子带对应的子带能量,该子带能量可以是音量,也可以是对数能量。即在一个实施例中,可以使用三角滤波器来计算各个子带对应的子带能量。例如可以通过30个三角滤波器计算每个子带的能量。每个子带的频率范围可以不相等,相邻子带之间可能存在频率上的交叠。然后对每个子带能量进行平滑处理,即获取到最近时间段中存在的相同位置的子带对应的能量,然后计算平均值,得到该子带平滑后的子带能量。比如,要对当前音频信号中第一个子带的子带能量进行平滑,可以获取到最近10次的历史音频信号中第一个子带的历史子带能量,然后计算平均子带能量,将平均子带能量作为当前音频信号中第一个子带平滑后的子带能量。依次计算得到每个子带对应的平滑后的子带能量。
然后将平滑后的各个子带能量进行比较,选取子带能量最大的子带作为目标子带,该目标子带包含最多的啸叫能量。在一个实施例中,可以从指定子带开始选取最大子带能量的子带。比如,该当前音频信号划分为30个子带,可以从第6个到第30个子带中选取具有最大的平滑后的子带能量对应的子带。在一个实施例中,可以根据比较结果从大到小依次选取预设数量的子带作为目标子带。比如,选取子带能量由大到小排序前三的子带作为目标子带。
在上述实施例中,通过将各个子带能量进行平滑后根据平滑后的子带能量从各个子带中选取目标子带,从而使选取的目标子带更加的准确。
在一个实施例中,基于平滑后的各个子带能量确定目标子带,包括:
获取当前音频信号对应的当前啸叫检测结果,根据当前啸叫检测结果从各个子带中确定各个啸叫子带,并得到各个啸叫子带能量;从各个啸叫子带能量中选取目标能量,并将目标能量对应的目标啸叫子带作为目标子带。
其中,啸叫子带是指包含有啸叫信号的子带。啸叫子带能量是指啸叫子带对应的能量。目标能量是指最大的啸叫子带能量。目标啸叫子带是指最大的啸叫子带能量对应的啸叫子带。
可行地,终端获取到当前音频信号对应的当前啸叫检测结果,当当前啸叫检测结果为当前音频信号中存在啸叫信号时,根据啸叫信号的频率与语音信号的频率从各个子带中确定啸叫信号对应的子带,从而得到各个啸叫子带。然后根据各个子带的能量确定各个啸叫子带对应的能量。然后比较各个啸叫子带能量,选取最大的啸叫子带能量作为目标能量,将目标能量对应的目标啸叫子带作为目标子带。
在一个实施例中,可以将各个啸叫子带能量对应的各个啸叫子带直接作为目标子带,即计算各个啸叫子带对应的子带增益系数,并获取到各个啸叫子带对应的历史子带增益,计算子带增益系数与历史子带增益的乘积,得到各个啸叫子带对应的当前子带增益,基于各个当 前子带增益对各个啸叫子带进行啸叫抑制,得到第一目标音频信号。
在上述实施例中,通过当前啸叫检测结果从各个子带中确定各个啸叫子带,然后从各个啸叫子带中确定目标子带,提高了得到目标子带的准确性。
在一个实施例中,如图7所示,步骤206,基于当前啸叫检测结果和当前语音检测结果确定当前音频信号对应的子带增益系数,包括:
步骤702,当当前语音检测结果为当前音频信号中未包含语音信号且当前啸叫检测结果为当前音频信号中包含啸叫信号时,获取预设递减系数,将预设递减系数作为当前音频信号对应的子带增益系数;
其中,预设递减系数是指预先设置好的使子带增益递减的系数。可以为小于1的值。
可行地,终端检测到当前音频信号中未包含语音信号且当前音频信号中存在啸叫信号时,获取到预设递减系数,将预设递减系数作为当前音频信号对应的子带增益系数。即在检测到当前音频信号中未包含语音信号且当前音频信号中存在啸叫信号时,需要将子带增益从初始值逐渐递减,直到当前音频信号中未存在啸叫信号或者当前音频信号的子带增益达到了预先的下限值。比如,0.08。
步骤704,当当前语音检测结果为当前音频信号中包含语音信号且当前啸叫检测结果为当前音频信号中包含啸叫信号时,获取预设第一递增系数,将预设第一递增系数作为当前音频信号对应的子带增益系数;
步骤706,当当前啸叫检测结果为当前音频信号中未包含啸叫信号时,获取预设第二递增系数,将预设第二递增系数作为当前音频信号对应的子带增益系数,其中,预设第一递增系数大于预设第二递增系数。
其中,预设第一递增系数是指预先设置好的在当前音频信号中包含语音信号且包含啸叫信号时使子带增益增加的系数。预设第二递增系数是指预先设置好的在当前音频信号中未包含啸叫信号时使子带增益增加的系数。预设第一递增系数大于预设第二递增系数。
可行地,终端检测到当前音频信号中包含语音信号且包含啸叫信号时,将预设第一递增系数作为当前音频信号对应的子带增益系数。此时为了保护语音信号的质量,需要迅速递增子带增益,从而使子带增益恢复至初始值。终端检测到当前音频信号中未包含啸叫信号时,将预设第二递增系数作为当前音频信号对应的子带增益系数,此时按照预设第二递增系数将当前音频信号的子带增益恢复至初始值。其中,预设第一递增系数大于预设第二递增系数,说明当前音频信号中包含语音信号且包含啸叫信号时子带增益恢复到初始值的速度要大于当前音频信号中未包含啸叫信号时的恢复速度。例如,在一次语音通话中,每间隔20ms,获取到当前音频信号,计算当前音频信号的子带增益。语音通话起始时,一般未存在啸叫信号,则子带增益会保持不变。然后当检测到存在啸叫信号且未包含语音信号时,按照预设递减系数将当前音频信号子带增益的初始值进行递减,然后当检测到存在啸叫信号且包含语音信号时,按照预设第一递增系数计算当前音频信号的子带增益,即迅速递增当前音频信号的子带增益,使子带增益回复到初始值。
在上述实施例中,根据当前语音检测结果和啸叫检测结果确定子带增益系数,从而能够使得到的子带增益系数更加的准确,从而使啸叫抑制更加准确,进一步提高了得到的第一目标音频信号的质量。
在一个实施例中,如图8所示,啸叫抑制方法还包括:
步骤802,基于预设低频范围从当前音频信号中确定目标低频信号和目标高频信号。
其中,预设低频范围是指预先设置好的人声的频率范围,比如,小于1400HZ。目标低频信号是指当前音频信号中在预设低频范围内的音频信号,目标高频信号是指当前音频信号中超过预设低频范围的音频信号。
可行地,终端按照预设低频范围将当前音频信号进行划分,得到目标低频信号和目标高频信号。比如,将当前音频信号中小于1400HZ的音频信号作为目标低频信号,将当前音频信号中超过1400HZ的音频信号作为目标高频信号。
步骤804,计算目标低频信号对应的低频能量,将低频能量进行平滑,得到平滑后的低频能量。
其中,低频能量是指目标低频信号对应的能量。
可行地,终端直接计算目标低频信号对应的低频能量,也可以将目标低频信号进行划分,得到各个低频信号的子带,然后计算各个低频信号的子带对应的能量,再计算各个低频信号的子带对应的能量之和,得到目标低频信号对应的低频能量。然后将将低频能量进行平滑处理,得到平滑后的低频能量。其中,可以使用如下公式(1)进行平滑处理。
E v(t)=a*E v(t-1)+(1-a)*E c  公式(1)
其中,E v(t)是指当前时间段对应的当前音频信号中目标低频信号对应的平滑后的低频能量。E v(t-1)是指前一历史时间段对应的历史音频信号中历史低频信号对应的历史低频能量。E c是指当前时间段对应的当前音频信号中目标低频信号对应的低频能量。a是指平滑系数,是预先设置好的。其中,E c大于E v(t-1)时a的取值可以和E c小于E v(t-1)时a的取值不同,用于更好的追踪能量的上升段和下降段。
步骤806,将目标高频信号进行划分,得到各个高频子带,并计算各个高频子带对应的高频子带能量。
可行地,终端可以将目标高频信号进行划分,得到各个高频子带,并使用三角滤波器计算各个高频子带对应的高频子带能量。
步骤808,获取各个高频子带对应的预设能量上限权重,基于各个高频子带对应的预设能量上限权重与平滑后的低频能量计算各个高频子带对应的高频子带上限能量。
其中,预设能量上限权重是指预先设置好的高频子带的能量上限权重,不同的高频子带有不同的预设能量上限权重,高频子带可以按照频率由低到高的顺序设置能量上限权重依次降低。高频子带上限能量是指高频子带能量的上限,高频子带的能量不能超过该上限。
可行地,终端获取各个高频子带对应的预设能量上限权重,并计算高频子带对应的预设能量上限权重与平滑后的低频能量的乘积,得到各个高频子带对应的高频子带上限能量。可以使用公式(2)计算高频子带上限能量。
E u(k)=E v(t)*b(k)  公式(2)
其中,k是指第k个高频子带,E u(k)为正整数,E u(k)为第k个高频子带对应的高频子带上限能量。E u(k)是指目标低频信号对应的平滑后的低频能量,b(k)是指第k个高频子带对应的预设能量上限权重,比如,各个高频子带的预设能量上限权重可以依次为 (0.8,0.7,0.6,…)。
步骤810,计算高频子带上限能量与高频子带能量的比值,得到各个高频子带上限增益。
其中,高频子带上限增益是指对高频子带进行子带增益时对应的上限增益,即对高频子带进行子带增益时不能超过高频子带上限增益。
可行地,终端分别计算每个高频子带上限能量与对应的高频子带能量的比值,得到各个高频子带上限增益。比如,可以使用公式(3)计算高频子带上限增益。
Figure PCTCN2021112769-appb-000001
其中,E(k)是指第k个高频子带对应的高频子带能量。E u(k)是指第k个高频子带对应的高频子带上限能量。M(k)是指第k个高频子带对应的高频子带上限增益。
步骤812,计算各个高频子带对应的各个高频子带增益,基于各个高频子带上限增益和各个高频子带增益确定各个高频子带目标增益,基于各个高频子带目标增益对各个高频子带进行啸叫抑制,得到当前时间段对应的第二目标音频信号。
其中,高频子带增益是根据高频子带增益系数和历史高频子带增益计算得到的。高频子带增益系数是根据当前啸叫检测结果和和当前语音检测结果确定的。历史高频子带增益是指历史时间段的历史音频信号对应的高频子带的增益。高频子带目标增益是指进行啸叫抑制时使用的增益。第二目标音频信号是指将所有高频子带都进行啸叫抑制后得到的音频信号。
可行地,终端获取到各个历史高频子带对应的各个历史高频子带增益,并根据当前啸叫检测结果和和当前语音检测结果确定各个高频子带增益系数,分别计算各个历史高频子带增益与各个高频子带增益系数的乘积得到各个高频子带对应的各个高频子带增益。分别比较各个高频子带上限增益与对应的各个高频子带增益,选取高频子带上限增益与高频子带增益中的较小增益作为高频子带目标增益。比如,可以使用公式(4)来选取高频子带目标增益。
B(k)=min[G(k),M(k)]  公式(4)
其中,B(k)是指第k个高频子带对应的高频子带目标增益,G(k)是指第k个高频子带对应的高频子带增益,M(k)是指第k个高频子带对应的高频子带上限增益。然后终端使用各个高频子带目标增益对各个高频子带进行啸叫抑制,将啸叫抑制后的各个高频子带对应的频域音频信号转换为时域音频信号,得到当前时间段对应的第二目标音频信号。
在一个具体的实施例中,如图8a所示,为能量约束的曲线示意图,该曲线示意图中横坐标表示频率,纵坐标表示能量,基于频率划分得到不同的子带,图中示出了9个子带,频率低于1400HZ的子带为低频带,高于1400HZ的子带为高频带,低频带为第1子带到第4子带,高频带为第5子带到第9子带。其中,曲线C为音频信号中只有语音信号时的能量曲线。曲线B是指对高频信号的能量约束曲线。曲线A是指为音频信号中包含语音信号和啸叫信号时的能量曲线。明显可以看出,在低频带即第1子带到第4子带有语音信号时,不进行能量约束。在高频带,即在第4个子带之后,包含有啸叫信号时,需要将音频信号的能量约束到曲线B以下,得到啸叫抑制后的音频信号。
在上述实施例中,通过使用高频子带上限增益来对高频子带的高频子带能量进行约束, 保证了得到的第二目标音频信号的质量。
在一个具体的实施例中,如图9所示,啸叫抑制方法,包括以下步骤:
步骤902,通过麦克风采集当前时间段对应的初始音频信号,对初始音频信号进行回声消除,得到回声消除后的初始音频信号。
步骤904,将回声消除后的初始音频信号输入到语音端点检测模型中进行检测,得到当前语音检测结果。基于当前语音检测结果对回声消除后的初始音频信号进行噪声抑制,得到噪声抑制后的初始音频信号。
步骤906,提取噪声抑制后的初始音频信号对应的初始音频特征,获取第一历史时间段对应的第一历史音频信号,并提取第一历史音频信号对应的第一历史音频特征,计算初始音频特征与第一历史音频特征的第一相似度,基于第一相似度确定当前啸叫检测结果。
步骤908,当当前啸叫检测结果为噪声抑制后的初始音频信号中存在啸叫信号时,将当前音频信号进行频域变换,得到频域音频信号;
步骤910,按照预设子带个数将频域音频信号进行划分,得到各个子带,计算各个子带对应的子带能量,并对各个子带能量进行平滑,得到平滑后的各个子带能量,基于平滑后的各个子带能量确定目标子带。
步骤912,当当前语音检测结果为当前音频信号中未包含语音信号且当前啸叫检测结果为当前音频信号中包含啸叫信号时,获取预设递减系数,将预设递减系数作为当前音频信号对应的子带增益系数。
步骤914,获取历史时间段的音频信号对应的历史子带增益,基于子带增益系数和历史子带增益计算当前音频信号对应的当前子带增益。
步骤916,基于当前子带增益对目标子带进行啸叫抑制,得到当前时间段对应的第一目标音频信号,将所述当前时间段对应的第一目标音频信号通过网络发送到接收第一目标音频信号的终端。
本申请还提供一种应用场景,该应用场景应用上述的啸叫抑制方法。可行地,该啸叫抑制方法在该应用场景的应用如下:
在企业微信应用中进行语音会议时,如图10所示,为啸叫抑制方法的具体场景应用图,其中,终端1002和终端1004在同一个房间内,并且与其他终端进行voip(Voice over Internet Protocol,基于IP的语音传输)通话。此时,终端1002的麦克风采集的语音会通过网络发送到终端1004,并经过终端1004的喇叭播放之后,终端1002的麦克风又会再次采集到该语音,因此,形成一个声学回路,如此循环反复,产生“啸叫”的声学效果。
此时进行啸叫抑制时,提供一种啸叫抑制方法的架构示意图,如图11所示,其中,终端都是通过上行音频处理对麦克风采集的音频信号进行处理后通过网络进行编码发送。通过下行音频处理对从网络接口获取到的音频信号进行处理后进行音频播放。
可行地,终端1002通过麦克风采集到的声音会经过上行音频处理后编码发送到网络侧形成网络信号。上行音频处理包括对音频信号进行回声消除,并对回声消除后的音频信号进行语音端点检测即语音分析识别非语音信号和语音信号。对非语音信号进行噪声印制,得到噪声抑制后的音频信号。然后对噪声抑制后的音频信号进行啸叫检测,得到啸叫检测结果。根据啸叫检测结果和语音端点检测结果进行啸叫抑制,得到啸叫抑制后的语音信号,将啸叫抑制后的语音信号进行音量控制,然后编码发送。
其中,在进行啸叫抑制时,如图12所示,为进行啸叫抑制的流程图。终端1002将需要 进行啸叫抑制的音频信号进行信号分析即将时域变换到频域,得到频域变换后的音频信号,然后将频域变换后的音频信号按照预设子带个数以及子带频率范围计算各个子带的能量。然后对各个子带的能量在时间上进行平滑处理,得到平滑后的各个子带能量。从平滑后的各个子带能量中选取最大的平滑后的子带能量作为目标子带。基于啸叫检测结果和语音检测结果确定音频信号对应的子带增益系数,具体来说,howlFlag表示啸叫检测结果,当howlFlag为1时,说明音频信号中存在啸叫信号,当howlFlag为0时,说明音频信号中未存在啸叫信号。当VAD为1时,说明音频信号中包括语音信号。当VAD为0时,说明音频信号中未包含语音信号。当在howlFlag为1且VAD为0时,获取预设递减系数作为子带增益系数,当在howlFlag为1且VAD为1时,获取预设第一递增系数作为子带增益系数,当在howlFlag为0时,获取预设第二递增系数作为子带增益系数。同时,获取到上一个音频信号在进行啸叫处理时使用的历史子带增益,计算历史子带增益与子带增益系数的乘积,得到当前子带增益,使用当前子带增益对目标子带进行啸叫抑制,得到啸叫抑制后的音频信号,然后将啸叫抑制后的音频信号从网络侧进行发送。
同时,在进行啸叫抑制时,还可以基于预设低频范围从当前音频信号中确定目标低频信号和目标高频信号;计算目标低频信号对应的低频能量,将低频能量进行平滑,得到平滑后的低频能量;将目标高频信号进行划分,得到各个高频子带,并计算各个高频子带对应的高频子带能量;获取各个高频子带对应的预设能量上限权重,基于各个高频子带对应的预设能量上限权重与平滑后的低频能量计算各个高频子带对应的高频子带上限能量;计算高频子带上限能量与高频子带能量的比值,得到各个高频子带上限增益;计算各个高频子带对应的各个高频子带增益,基于各个高频子带上限增益和各个高频子带增益确定各个高频子带目标增益,基于各个高频子带目标增益对各个高频子带进行啸叫抑制,得到当前时间段对应的第二目标音频信号,将第二目标音频信号通过网络侧进行发送。终端1004通过网络接口接收到网络信号时,进行解码得到音频信号,然后进行下行音频处理后进行音频播放,该下行音频处理可以是进行音量控制等等。同理,终端1004中的上行音频处理也可以使用相同的方法对音频进行处理后通过网络侧进行发送。
在一个具体的实施例中,提供另一种啸叫抑制方法的架构示意图,如图13所示,具体来说:
如图10所示,终端1002在发送音频信号到各个终端时,由于终端1002与终端1004较近,可能导致啸叫。而其他终端,包括终端1008、终端1010和终端1012。与终端1002较远不会产生啸叫,此时,可以在接收音频信号的终端中进行啸叫抑制。具体来说:
在终端1004中,通过网络接口接收到终端1002发送的网络信号时,进行解码,得到音频信号,该音频信号一般是在发送终端经过回声消除和噪声抑制的信号。此时终端1004直接对音频信号进行啸叫检测和语音端点检测,得到啸叫检测结果和语音端点检测结果。并且,终端1004通过麦克风采集同样时间长度的历史音频信号,进行本地检测,该本地检测是用于检测终端1004和终端1002是否相近。具体来说:通过提取通过麦克风采集同样时间长度的音频信号的音频特征,以及提取通过网络侧接收的音频信号的音频特征,然后计算相似度。当该相似度持续一段时间均超过预先设置的相似度阈值时,即说明终端1004和终端1002相近,得到本地检测结果为终端1004和终端1002相近,说明终端1004是造成啸叫的音频回路上的终端。此时根据本地检测结果、啸叫检测结果和语音端点检测结果进行啸叫抑制,即执行如图12的流程对啸叫进行抑制,得到啸叫抑制后的音频信号,然后终端1004将啸叫抑制 后的音频信号进行播放。在一个实施例中,当啸叫检测结果为音频信号中存在啸叫信号的可能性超过预设本地检测暂停阈值时,则暂停本地检测的运行,只根据啸叫检测结果和语音端点检测结果进行啸叫抑制,节省终端资源。
通过在接收音频信号的终端中进行啸叫抑制,保证了其他接收音频的终端接收的音频信号质量。并且通过本地检测结果、啸叫检测结果和语音端点检测结果进行啸叫抑制,提高了啸叫抑制的准确性。同理,终端1004的下行音频的处理方法,即上述对音频信息进行啸叫处理的流程也可以应用到其他终端中的下行音频处理中,比如,终端1002中。
在一个具体的实施例中,如图14所示,还提供另一种啸叫抑制方法的架构示意图,具体来说:
终端1002通过麦克风采集到当前音频信号,将当前音频信号进行回声消除以及噪声印制后,进行啸叫检测,得到当前啸叫检测结果。当当前啸叫检测结果为当前音频信号中存在啸叫信号时,获取待播放音频信号和预设音频水印信号,将预设音频水印信号添加到待播放音频信号中并通过喇叭进行播放,同时将当前音频信号经过音量控制并编码成网络信号,通过网络接口发送到终端1004.
此时终端1004通过麦克风采集终端1002通过喇叭播放的音频信号,然后进行水印检测,即计算采集的音频信号的邻带能量比值,当当邻带能量比值超过预设邻带能量比阈值时,确定采集的音频信号中包含有设置好的音频水印信号。此时,终端1004获取到终端1002发送的网络信号,进行解码后得到音频信号,将该音频信号进行啸叫抑制,即执行如图12所示的流程,得到啸叫抑制后的音频信号,将啸叫抑制后的音频信号通过喇叭播放。通过发送终端在播放的音频信号中添加音频水印信号,由于产生啸叫的终端距离较近,则接收终端会通过麦克风采集到添加音频水印信号的音频信号,对采集的音频信号进行水印检测后进行啸叫抑制,提高了啸叫抑制的效率的准确性。同理终端1004通过网络侧发送音频信号时也可以添加音频水印信号,则终端1002也可以进行水印检测来确定是否对接收的音频信号进行啸叫抑制。
应该理解的是,虽然图2、图3-8以及图9的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2、图3-8以及图9中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图15所示,提供了一种啸叫抑制装置1500,该装置可以采用软件模块或硬件模块,或者是二者的结合成为计算机设备的一部分,该装置具体包括:信号变换模块1502、子带确定模块1504、系数确定模块1506、增益确定模块1508和啸叫抑制模块1510,其中:
信号变换模块1502,用于获取当前时间段对应的当前音频信号,将当前音频信号进行频域变换,得到频域音频信号;
子带确定模块1504,用于对频域音频信号进行划分,得到各个子带,从各个子带中确定目标子带;
系数确定模块1506,用于获取当前音频信号对应的当前啸叫检测结果和当前语音检测结 果,基于当前啸叫检测结果和当前语音检测结果确定当前音频信号对应的子带增益系数;
增益确定模块1508,用于获取历史时间段的音频信号对应的历史子带增益,基于子带增益系数和历史子带增益计算当前音频信号对应的当前子带增益;
啸叫抑制模块1510,用于基于当前子带增益对目标子带进行啸叫抑制,得到当前时间段对应的第一目标音频信号。
在一个实施例中,信号变换模块1502,包括:
回声消除单元,用于采集当前时间段对应的初始音频信号,对初始音频信号进行回声消除,得到回声消除后的初始音频信号;
语音检测单元,用于将回声消除后的初始音频信号进行语音端点检测,得到当前语音检测结果;
噪声抑制单元,用于基于当前语音检测结果对回声消除后的初始音频信号进行噪声抑制,得到噪声抑制后的初始音频信号;
啸叫检测单元,用于对噪声抑制后的初始音频信号进行啸叫检测,得到当前啸叫检测结果;
当前音频信号确定单元,用于当当前啸叫检测结果为噪声抑制后的初始音频信号中存在啸叫信号时,将噪声抑制后的初始音频信号作为当前时间段对应的当前音频信号。
在一个实施例中,语音检测单元还用于将回声消除后的初始音频信号输入到语音端点检测模型中进行检测,得到当前语音检测结果,语音端点检测模型是基于训练音频信号和对应的训练语音检测结果使用神经网络算法进行训练得到的。
在一个实施例中,语音检测单元还用于将回声消除后的初始音频信号进行低通滤波,得到低频信号;计算低频信号对应的信号能量,基于信号能量计算能量波动,根据能量波动确定当前语音检测结果。
在一个实施例中,语音检测单元还用于将回声消除后的初始音频信号进行低通滤波,得到低频信号;对低频信号进行基音检测,得到基音周期,根据基音周期确定当前语音检测结果。
在一个实施例中,啸叫检测单元还用于将噪声抑制后的初始音频信号输入到啸叫检测模型中进行检测,得到当前啸叫检测结果,啸叫检测模型是基于啸叫训练音频信号和对应的训练啸叫检测结果使用神经网络算法进行训练得到的。
在一个实施例中,啸叫检测单元还用于提取噪声抑制后的初始音频信号对应的初始音频特征;获取第一历史时间段对应的第一历史音频信号,并提取第一历史音频信号对应的第一历史音频特征;计算初始音频特征与第一历史音频特征的第一相似度,基于第一相似度确定当前啸叫检测结果。
在一个实施例中,啸叫抑制方法,还包括:
水印添加模块,用于当当前啸叫检测结果为当前音频信号中存在啸叫信号时,获取待播放音频信号和预设音频水印信号;将预设音频水印信号添加到待播放音频信号中并进行播放。
在一个实施例中,啸叫抑制方法,还包括:
水印检测模块,用于采集第一时间段对应的第一音频信号,基于所述第一音频信号进行音频水印检测,确定第一音频信号中包含目标音频水印信号;
信号得到模块,用于接收第二时间段对应的目标网络编码音频信号,将目标网络编码音频信号进行解码,得到目标网络音频信号;
当前音频信号确定模块,用于基于第一音频信号中包含目标音频水印信号将目标网络音频信号作为当前音频信号。
在一个实施例中,信号变换模块1502,包括:
网络信号得到模块,用于接收当前时间段对应的当前网络编码音频信号,将网络编码音频信号进行解码,得到当前网络音频信号;
网络信号检测模块,用于将当前网络音频信号进行语音端点检测,得到网络语音检测结果,并对当前网络音频信号进行啸叫检测,得到网络啸叫检测结果;
特征提取模块,用于提取当前网络音频信号的网络音频特征,并获取第二历史时间段的第二历史音频信号,提取第二历史音频信号对应的第二历史音频特征;
当前音频信号得到模块,用于计算网络音频特征与第二历史音频特征的网络音频相似度,基于网络音频相似度和网络啸叫检测结果确定网络音频信号为当前时间段对应的当前音频信号。
在一个实施例中,子带确定模块1504还用于按照预设子带个数将频域音频信号进行划分,得到各个子带;计算各个子带对应的子带能量,并对各个子带能量进行平滑,得到平滑后的各个子带能量;基于平滑后的各个子带能量确定目标子带。
在一个实施例中,子带确定模块1504还用于获取当前音频信号对应的当前啸叫检测结果,根据当前啸叫检测结果从各个子带中确定各个啸叫子带,并得到各个啸叫子带能量;从各个啸叫子带能量中选取目标能量,并将目标能量对应的目标啸叫子带作为目标子带。
在一个实施例中,系数确定模块1506还用于当当前语音检测结果为当前音频信号中未包含语音信号且当前啸叫检测结果为当前音频信号中包含啸叫信号时,获取预设递减系数,将预设递减系数作为当前音频信号对应的子带增益系数;当当前语音检测结果为当前音频信号中包含语音信号且当前啸叫检测结果为当前音频信号中包含啸叫信号时,获取预设第一递增系数,将预设第一递增系数作为当前音频信号对应的子带增益系数;当当前啸叫检测结果为当前音频信号中未包含啸叫信号时,获取预设第二递增系数,将预设第二递增系数作为当前音频信号对应的子带增益系数,其中,预设第一递增系数大于预设第二递增系数。
在一个实施例中,啸叫抑制方法还包括:
信号划分模块,用于基于预设低频范围从当前音频信号中确定目标低频信号和目标高频信号;
低频能量计算模块,用于计算目标低频信号对应的低频能量,将低频能量进行平滑,得到平滑后的低频能量;
高频能量计算模块,用于将目标高频信号进行划分,得到各个高频子带,并计算各个高频子带对应的高频子带能量;
上限能量计算模块,用于获取各个高频子带对应的预设能量上限权重,基于各个高频子带对应的预设能量上限权重与平滑后的低频能量计算各个高频子带对应的高频子带上限能量;
上限增益确定模块,用于计算高频子带上限能量与高频子带能量的比值,得到各个高频子带上限增益;
目标音频信号得到模块,用于计算各个高频子带对应的各个高频子带增益,基于各个高频子带上限增益和各个高频子带增益确定各个高频子带目标增益,基于各个高频子带目标增益对各个高频子带进行啸叫抑制,得到当前时间段对应的第二目标音频信号。
关于啸叫抑制装置的具体限定可以参见上文中对于啸叫抑制方法的限定,在此不再赘述。上述啸叫抑制装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图16所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、运营商网络、NFC(近场通信)或其他技术实现。该计算机可读指令被处理器执行时以实现一种啸叫抑制方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图16中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,还提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机可读指令,该处理器执行计算机可读指令时实现上述各方法实施例中的步骤。
在一个实施例中,一种非易失性的计算机可读存储介质,存储有计算机可读指令,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述任一实施例中啸叫抑制方法的步骤。
在一个实施例中,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各方法实施例中的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因 此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种啸叫抑制方法,由计算机设备执行,所述方法包括:
    获取当前时间段对应的当前音频信号,将所述当前音频信号进行频域变换,得到频域音频信号;
    对所述频域音频信号进行划分,得到各个子带,从所述各个子带中确定目标子带;
    获取所述当前音频信号对应的当前啸叫检测结果和当前语音检测结果,基于所述当前啸叫检测结果和当前语音检测结果确定所述当前音频信号对应的子带增益系数;
    获取历史时间段的音频信号对应的历史子带增益,基于所述子带增益系数和所述历史子带增益计算所述当前音频信号对应的当前子带增益;及
    基于所述当前子带增益对所述目标子带进行啸叫抑制,得到所述当前时间段对应的第一目标音频信号。
  2. 根据权利要求1所述的方法,其特征在于,所述获取当前时间段对应的当前音频信号,包括:
    采集所述当前时间段对应的初始音频信号,对所述初始音频信号进行回声消除,得到回声消除后的初始音频信号;
    将所述回声消除后的初始音频信号进行语音端点检测,得到当前语音检测结果;
    基于所述当前语音检测结果对所述回声消除后的初始音频信号进行噪声抑制,得到噪声抑制后的初始音频信号;
    对所述噪声抑制后的初始音频信号进行啸叫检测,得到当前啸叫检测结果;及
    当所述当前啸叫检测结果为所述噪声抑制后的初始音频信号中存在啸叫信号时,将所述噪声抑制后的初始音频信号作为所述当前时间段对应的当前音频信号。
  3. 根据权利要求2所述的方法,其特征在于,所述将所述回声消除后的初始音频信号进行语音端点检测,得到当前语音检测结果,包括:
    将所述回声消除后的初始音频信号进行低通滤波,得到低频信号;及
    计算所述低频信号对应的信号能量,基于所述信号能量计算能量波动,根据所述能量波动确定所述当前语音检测结果。
  4. 根据权利要求2所述的方法,其特征在于,所述将所述回声消除后的初始音频信号进行语音端点检测,得到当前语音检测结果,包括:
    将所述回声消除后的初始音频信号进行低通滤波,得到低频信号;及
    对所述低频信号进行基音检测,得到基音周期,根据所述基音周期确定所述当前语音检测结果。
  5. 根据权利要求2所述的方法,其特征在于,所述对所述噪声抑制后的初始音频信号进行啸叫检测,得到当前啸叫检测结果,包括:
    提取所述噪声抑制后的初始音频信号对应的初始音频特征;
    获取第一历史时间段对应的第一历史音频信号,并提取所述第一历史音频信号对应的第一历史音频特征;及
    计算所述初始音频特征与所述第一历史音频特征的第一相似度,基于所述第一相似度确定所述当前啸叫检测结果。
  6. 根据权利要求1所述的方法,其特征在于,所述方法,还包括:
    当所述当前啸叫检测结果为所述当前音频信号中存在啸叫信号时,获取待播放音频信号 和预设音频水印信号;及
    将所述预设音频水印信号添加到所述待播放音频信号中并进行播放。
  7. 根据权利要求6所述的方法,其特征在于,所述方法,还包括:
    采集第一时间段对应的第一音频信号,基于所述第一音频信号进行音频水印检测,确定所述第一音频信号中包含目标音频水印信号;
    接收第二时间段对应的目标网络编码音频信号,将所述目标网络编码音频信号进行解码,得到目标网络音频信号;及
    基于所述第一音频信号中包含目标音频水印信号将所述目标网络音频信号作为所述当前音频信号。
  8. 根据权利要求1所述的方法,其特征在于,所述获取当前时间段对应的当前音频信号,包括:
    接收所述当前时间段对应的当前网络编码音频信号,将所述网络编码音频信号进行解码,得到当前网络音频信号;
    将所述当前网络音频信号进行语音端点检测,得到网络语音检测结果,并对所述当前网络音频信号进行啸叫检测,得到网络啸叫检测结果;
    提取所述当前网络音频信号的网络音频特征,并获取第二历史时间段的第二历史音频信号,提取所述第二历史音频信号对应的第二历史音频特征;及
    计算所述网络音频特征与所述第二历史音频特征的网络音频相似度,基于所述网络音频相似度和所述网络啸叫检测结果确定所述网络音频信号为所述当前时间段对应的当前音频信号。
  9. 根据权利要求1所述的方法,其特征在于,所述对所述频域音频信号进行划分,得到各个子带,从所述各个子带中确定目标子带,包括:
    按照预设子带个数将所述频域音频信号进行划分,得到各个子带;
    计算所述各个子带对应的子带能量,并对各个子带能量进行平滑,得到所述平滑后的各个子带能量;及
    基于所述平滑后的各个子带能量确定目标子带。
  10. 根据权利要求9所述的方法,其特征在于,所述基于所述平滑后的各个子带能量确定目标子带,包括:
    获取所述当前音频信号对应的当前啸叫检测结果,根据所述当前啸叫检测结果从所述各个子带中确定各个啸叫子带,并得到所述各个啸叫子带能量;及
    从所述各个啸叫子带能量中选取目标能量,并将所述目标能量对应的目标啸叫子带作为目标子带。
  11. 根据权利要求1所述的方法,其特征在于,所述基于所述当前啸叫检测结果和所述当前语音检测结果确定所述当前音频信号对应的子带增益系数,包括:
    当所述当前语音检测结果为所述当前音频信号中未包含语音信号且所述当前啸叫检测结果为所述当前音频信号中包含啸叫信号时,获取预设递减系数,将所述预设递减系数作为所述当前音频信号对应的子带增益系数;
    当所述当前语音检测结果为所述当前音频信号中包含语音信号且所述当前啸叫检测结果为所述当前音频信号中包含啸叫信号时,获取预设第一递增系数,将所述预设第一递增系数作为所述当前音频信号对应的子带增益系数;及
    当所述当前啸叫检测结果为所述当前音频信号中未包含啸叫信号时,获取预设第二递增系数,将所述预设第二递增系数作为所述当前音频信号对应的子带增益系数,其中,所述预设第一递增系数大于所述预设第二递增系数。
  12. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    基于预设低频范围从所述当前音频信号中确定目标低频信号和目标高频信号;
    计算所述目标低频信号对应的低频能量,将所述低频能量进行平滑,得到平滑后的低频能量;
    将所述目标高频信号进行划分,得到各个高频子带,并计算所述各个高频子带对应的高频子带能量;
    获取所述各个高频子带对应的预设能量上限权重,基于所述各个高频子带对应的预设能量上限权重与所述平滑后的低频能量计算所述各个高频子带对应的高频子带上限能量;
    计算所述高频子带上限能量与所述高频子带能量的比值,得到各个高频子带上限增益;及
    计算各个高频子带对应的各个高频子带增益,基于所述各个高频子带上限增益和所述各个高频子带增益确定所述各个高频子带目标增益,基于所述各个高频子带目标增益对所述各个高频子带进行啸叫抑制,得到所述当前时间段对应的第二目标音频信号。
  13. 一种啸叫抑制装置,其特征在于,所述装置包括:
    信号变换模块,用于获取当前时间段对应的当前音频信号,将所述当前音频信号进行频域变换,得到频域音频信号;
    子带确定模块,用于对所述频域音频信号进行划分,得到各个子带,从所述各个子带中确定目标子带;
    系数确定模块,用于获取所述当前音频信号对应的当前啸叫检测结果和当前语音检测结果,基于所述当前啸叫检测结果和所述当前语音检测结果确定所述当前音频信号对应的子带增益系数;
    增益确定模块,用于获取历史时间段的音频信号对应的历史子带增益,基于所述子带增益系数和所述历史子带增益计算所述当前音频信号对应的当前子带增益;及
    啸叫抑制模块,用于基于所述当前子带增益对所述目标子带进行啸叫抑制,得到所述当前时间段对应的第一目标音频信号。
  14. 根据权利要求13所述的装置,其特征在于,所述信号变换模块,包括:
    回声消除单元,用于采集所述当前时间段对应的初始音频信号,对所述初始音频信号进行回声消除,得到回声消除后的初始音频信号;
    语音检测单元,用于将所述回声消除后的初始音频信号进行语音端点检测,得到当前语音检测结果;
    噪声抑制单元,用于基于所述当前语音检测结果对所述回声消除后的初始音频信号进行噪声抑制,得到噪声抑制后的初始音频信号;
    啸叫检测单元,用于对所述噪声抑制后的初始音频信号进行啸叫检测,得到当前啸叫检测结果;及
    当前音频信号确定单元,用于当所述当前啸叫检测结果为所述噪声抑制后的初始音频信号中存在啸叫信号时,将所述噪声抑制后的初始音频信号作为所述当前时间段对应的当前音频信号。
  15. 根据权利要求13所述的装置,其特征在于,所述装置,还包括:
    水印添加模块,用于当所述当前啸叫检测结果为所述当前音频信号中存在啸叫信号时,获取待播放音频信号和预设音频水印信号;及将所述预设音频水印信号添加到所述待播放音频信号中并进行播放。
  16. 根据权利要求15所述的装置,其特征在于,所述装置,还包括:
    水印检测模块,用于采集第一时间段对应的第一音频信号,基于所述第一音频信号进行音频水印检测,确定所述第一音频信号中包含目标音频水印信号;
    信号得到模块,用于接收第二时间段对应的目标网络编码音频信号,将所述目标网络编码音频信号进行解码,得到目标网络音频信号;及
    当前音频信号确定模块,用于基于所述第一音频信号中包含目标音频水印信号将所述目标网络音频信号作为所述当前音频信号。
  17. 根据权利要求13所述的装置,其特征在于,所述信号变换模块,包括:
    网络信号得到模块,用于接收所述当前时间段对应的当前网络编码音频信号,将所述网络编码音频信号进行解码,得到当前网络音频信号;
    网络信号检测模块,用于将所述当前网络音频信号进行语音端点检测,得到网络语音检测结果,并对所述当前网络音频信号进行啸叫检测,得到网络啸叫检测结果;
    特征提取模块,用于提取所述当前网络音频信号的网络音频特征,并获取第二历史时间段的第二历史音频信号,提取所述第二历史音频信号对应的第二历史音频特征;及
    当前音频信号得到模块,用于计算所述网络音频特征与所述第二历史音频特征的网络音频相似度,基于所述网络音频相似度和所述网络啸叫检测结果确定所述网络音频信号为所述当前时间段对应的当前音频信号。
  18. 根据权利要求13所述的装置,其特征在于,所述装置还包括:
    信号划分模块,用于基于预设低频范围从所述当前音频信号中确定目标低频信号和目标高频信号;
    低频能量计算模块,用于计算所述目标低频信号对应的低频能量,将所述低频能量进行平滑,得到平滑后的低频能量;
    高频能量计算模块,用于将所述目标高频信号进行划分,得到各个高频子带,并计算所述各个高频子带对应的高频子带能量;
    上限能量计算模块,用于获取所述各个高频子带对应的预设能量上限权重,基于所述各个高频子带对应的预设能量上限权重与所述平滑后的低频能量计算所述各个高频子带对应的高频子带上限能量;
    上限增益确定模块,用于计算所述高频子带上限能量与所述高频子带能量的比值,得到各个高频子带上限增益;及
    目标音频信号得到模块,用于计算各个高频子带对应的各个高频子带增益,基于所述各个高频子带上限增益和所述各个高频子带增益确定所述各个高频子带目标增益,基于所述各个高频子带目标增益对所述各个高频子带进行啸叫抑制,得到所述当前时间段对应的第二目标音频信号。
  19. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器执行时实现权利要求1 至12中任一项所述的方法的步骤。
  20. 一个或多个存储有计算机可读指令的非易失性存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述处理器执行时实现权利要求1至12中任一项所述的方法的步骤。
PCT/CN2021/112769 2020-09-30 2021-08-16 啸叫抑制方法、装置、计算机设备和存储介质 WO2022068440A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21874102.3A EP4131254A4 (en) 2020-09-30 2021-08-16 WHISTLE SUPPRESSION METHOD AND APPARATUS, COMPUTER DEVICE AND STORAGE MEDIUM
US17/977,380 US20230046518A1 (en) 2020-09-30 2022-10-31 Howling suppression method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011062254.8 2020-09-30
CN202011062254.8A CN114333749A (zh) 2020-09-30 2020-09-30 啸叫抑制方法、装置、计算机设备和存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/977,380 Continuation US20230046518A1 (en) 2020-09-30 2022-10-31 Howling suppression method and apparatus, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022068440A1 true WO2022068440A1 (zh) 2022-04-07

Family

ID=80951084

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/112769 WO2022068440A1 (zh) 2020-09-30 2021-08-16 啸叫抑制方法、装置、计算机设备和存储介质

Country Status (4)

Country Link
US (1) US20230046518A1 (zh)
EP (1) EP4131254A4 (zh)
CN (1) CN114333749A (zh)
WO (1) WO2022068440A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117440093A (zh) * 2023-10-31 2024-01-23 中移互联网有限公司 在线会议声音自激消除方法、装置、设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024206654A1 (en) * 2023-03-30 2024-10-03 Qualcomm Incorporated Machine learning-based feedback cancellation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1684143A (zh) * 2004-04-14 2005-10-19 华为技术有限公司 一种语音增强的方法
CN109461455A (zh) * 2018-11-30 2019-03-12 维沃移动通信(深圳)有限公司 一种消除啸叫的系统及方法
CN109637552A (zh) * 2018-11-29 2019-04-16 河北远东通信系统工程有限公司 一种抑制音频设备啸叫的语音处理方法
CN110012408A (zh) * 2019-04-19 2019-07-12 宁波启拓电子设备有限公司 啸叫检测方法及装置
CN110213694A (zh) * 2019-04-16 2019-09-06 浙江大华技术股份有限公司 一种音频设备及其啸叫的处理方法、计算机存储介质
CN111724808A (zh) * 2019-03-18 2020-09-29 Oppo广东移动通信有限公司 音频信号处理方法、装置、终端及存储介质

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0678046A (ja) * 1992-08-25 1994-03-18 Fujitsu Ltd ハンズフリー・システムで用いられる音声スイッチ
JPH10260692A (ja) * 1997-03-18 1998-09-29 Toshiba Corp 音声の認識合成符号化/復号化方法及び音声符号化/復号化システム
US7451093B2 (en) * 2004-04-29 2008-11-11 Srs Labs, Inc. Systems and methods of remotely enabling sound enhancement techniques
US7742608B2 (en) * 2005-03-31 2010-06-22 Polycom, Inc. Feedback elimination method and apparatus
GB2448201A (en) * 2007-04-04 2008-10-08 Zarlink Semiconductor Inc Cancelling non-linear echo during full duplex communication in a hands free communication system.
WO2014094242A1 (en) * 2012-12-18 2014-06-26 Motorola Solutions, Inc. Method and apparatus for mitigating feedback in a digital radio receiver
CN104934039B (zh) * 2014-03-21 2018-10-23 鸿富锦精密工业(深圳)有限公司 音频信号的水印信息加载装置及方法
KR102263700B1 (ko) * 2015-08-06 2021-06-10 삼성전자주식회사 단말기 및 단말기의 동작 방법
JP6446145B2 (ja) * 2015-09-28 2018-12-26 旭化成エレクトロニクス株式会社 ハウリング抑制装置
CN111048119B (zh) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 通话音频混音处理方法、装置、存储介质和计算机设备
JP2023519249A (ja) * 2020-03-23 2023-05-10 ドルビー ラボラトリーズ ライセンシング コーポレイション エコー残留抑制
US11250833B1 (en) * 2020-09-16 2022-02-15 Apple Inc. Method and system for detecting and mitigating audio howl in headsets

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1684143A (zh) * 2004-04-14 2005-10-19 华为技术有限公司 一种语音增强的方法
CN109637552A (zh) * 2018-11-29 2019-04-16 河北远东通信系统工程有限公司 一种抑制音频设备啸叫的语音处理方法
CN109461455A (zh) * 2018-11-30 2019-03-12 维沃移动通信(深圳)有限公司 一种消除啸叫的系统及方法
CN111724808A (zh) * 2019-03-18 2020-09-29 Oppo广东移动通信有限公司 音频信号处理方法、装置、终端及存储介质
CN110213694A (zh) * 2019-04-16 2019-09-06 浙江大华技术股份有限公司 一种音频设备及其啸叫的处理方法、计算机存储介质
CN110012408A (zh) * 2019-04-19 2019-07-12 宁波启拓电子设备有限公司 啸叫检测方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117440093A (zh) * 2023-10-31 2024-01-23 中移互联网有限公司 在线会议声音自激消除方法、装置、设备及存储介质

Also Published As

Publication number Publication date
EP4131254A1 (en) 2023-02-08
US20230046518A1 (en) 2023-02-16
CN114333749A (zh) 2022-04-12
EP4131254A4 (en) 2023-10-11

Similar Documents

Publication Publication Date Title
US10504539B2 (en) Voice activity detection systems and methods
JP5666444B2 (ja) 特徴抽出を使用してスピーチ強調のためにオーディオ信号を処理する装置及び方法
CN107945815B (zh) 语音信号降噪方法及设备
JP5528538B2 (ja) 雑音抑圧装置
US8831936B2 (en) Systems, methods, apparatus, and computer program products for speech signal processing using spectral contrast enhancement
JP6169849B2 (ja) 音響処理装置
Kim et al. Nonlinear enhancement of onset for robust speech recognition.
US20130282369A1 (en) Systems and methods for audio signal processing
JP2004502977A (ja) サブバンド指数平滑雑音消去システム
CN112004177B (zh) 一种啸叫检测方法、麦克风音量调节方法及存储介质
WO2012158156A1 (en) Noise supression method and apparatus using multiple feature modeling for speech/noise likelihood
WO2022068440A1 (zh) 啸叫抑制方法、装置、计算机设备和存储介质
US11380312B1 (en) Residual echo suppression for keyword detection
US8423357B2 (en) System and method for biometric acoustic noise reduction
CN111292758B (zh) 语音活动检测方法及装置、可读存储介质
EP1913591B1 (en) Enhancement of speech intelligibility in a mobile communication device by controlling the operation of a vibrator in dependance of the background noise
WO2019072395A1 (en) APPARATUS AND METHOD FOR IMPROVING SIGNALS
JP2007293059A (ja) 信号処理装置およびその方法
US11386911B1 (en) Dereverberation and noise reduction
US11462231B1 (en) Spectral smoothing method for noise reduction
US11259117B1 (en) Dereverberation and noise reduction
US20130226568A1 (en) Audio signals by estimations and use of human voice attributes
Yang et al. Environment-Aware Reconfigurable Noise Suppression
Krishnamoorthy et al. Processing noisy speech for enhancement
Kamaraju et al. Speech Enhancement Technique Using Eigen Values

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21874102

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021874102

Country of ref document: EP

Effective date: 20221028

NENP Non-entry into the national phase

Ref country code: DE