CN112233689B - Audio noise reduction method, device, equipment and medium - Google Patents

Audio noise reduction method, device, equipment and medium Download PDF

Info

Publication number
CN112233689B
CN112233689B CN202011019647.0A CN202011019647A CN112233689B CN 112233689 B CN112233689 B CN 112233689B CN 202011019647 A CN202011019647 A CN 202011019647A CN 112233689 B CN112233689 B CN 112233689B
Authority
CN
China
Prior art keywords
noise
frequency energy
noise spectrum
audio frame
low
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011019647.0A
Other languages
Chinese (zh)
Other versions
CN112233689A (en
Inventor
郝斌
冯大航
陈孝良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202011019647.0A priority Critical patent/CN112233689B/en
Publication of CN112233689A publication Critical patent/CN112233689A/en
Application granted granted Critical
Publication of CN112233689B publication Critical patent/CN112233689B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The disclosure provides an audio noise reduction method, an audio noise reduction device, audio noise reduction equipment and an audio noise reduction medium, and belongs to the technical field of audio processing. According to the technical scheme provided by the embodiment of the disclosure, the characteristics of low-frequency and high-frequency distribution of the impact noise are considered, the extracted first noise spectrum can accurately distinguish the human voice and the noise, in addition, the signal attenuation components are further estimated through the reverberation transfer function, after noise reduction, noise estimation is not needed through multi-frame audio information, so that delay is small, the impact noise can be effectively inhibited under the small delay frame number, the signal attenuation components are effectively removed from a target audio frame after noise reduction, and the noise reduction effect is better.

Description

Audio noise reduction method, device, equipment and medium
Technical Field
The present disclosure relates to the field of audio processing technologies, and in particular, to an audio denoising method, apparatus, device, and medium.
Background
In recent years, with the continuous development of audio processing technology, intelligent voice interaction systems such as intelligent sound boxes and vehicle-mounted voice interaction systems are continuously popularized, and the intelligent voice interaction systems receive audio including user voice and process the audio, so that the user voice in the audio is recognized, and man-machine interaction is realized. In practical use, different kinds of noise are often mixed in the audio received by the intelligent voice interaction system, so that the audio needs to be subjected to noise reduction processing firstly.
In the related art, the audio noise reduction method is generally: and selecting the minimum smooth power spectrum in the frames as the smooth power spectrum needing to track the minimum value according to the audio information of the current frame and the future frames by utilizing the characteristic of rapid attenuation of the impact noise.
In the method, when the number of delay frames is set to be small, a plurality of voices are estimated to be impact noise, so that voice damage is caused, and a plurality of attenuation sections are still included after processing, so that the noise reduction effect is poor.
Disclosure of Invention
The embodiment of the disclosure provides an audio noise reduction method, an audio noise reduction device, audio noise reduction equipment and an audio noise reduction medium, and improves the noise reduction effect. The technical scheme is as follows:
in one aspect, an audio noise reduction method is provided, and the method includes:
extracting a first noise spectrum of the audio frame to be denoised according to the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised;
processing the first noise spectrum based on a reverberation transfer function to obtain a second noise spectrum, wherein the second noise spectrum comprises signal attenuation components in the audio frame to be subjected to noise reduction;
and removing the noise in the audio frame to be subjected to noise reduction according to the second noise spectrum to obtain a target audio frame.
In one possible implementation, the processing the first noise spectrum based on the reverberation transfer function to obtain a second noise spectrum includes:
acquiring a product of the first noise spectrum and the reverberation transfer function, and taking the product as a third noise spectrum, wherein the third noise spectrum is a signal attenuation component in the audio frame to be denoised;
acquiring the sum of the first noise spectrum and the third noise spectrum, and taking the sum of the first noise spectrum and the third noise spectrum as the second noise spectrum.
In one possible implementation, the obtaining of the reverberation transfer function includes:
acquiring environmental information, and receiving the receiver position, the sound source position and the sound velocity of the audio frame to be denoised;
and determining the reverberation transfer function according to the environment information, the receiver position for receiving the audio frame to be denoised, the sound source position and the sound velocity.
In one possible implementation manner, the extracting a first noise spectrum of the audio frame to be denoised according to a ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised includes:
determining the noise existence probability of the audio frame to be denoised according to the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised;
extracting a first noise spectrum of the audio frame to be denoised according to the noise existence probability;
in one possible implementation manner, the determining, according to a ratio of low-frequency energy to high-frequency energy in an audio frame to be denoised, a noise existence probability of the audio frame to be denoised includes:
determining a first noise existence probability as a noise existence probability of the audio frame to be denoised in response to the fact that the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised is greater than a ratio threshold;
and determining a second noise existence probability as the noise existence probability of the audio frame to be subjected to noise reduction in response to the fact that the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be subjected to noise reduction is smaller than or equal to a ratio threshold, wherein the first noise existence probability is smaller than the second noise existence probability.
In one possible implementation manner, the determining of the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be denoised includes:
acquiring a low-frequency energy mean value and a high-frequency energy mean value in the audio frame to be denoised according to the frequency spectrum of the audio frame to be denoised;
and determining the ratio of the low-frequency energy mean value to the high-frequency energy mean value as the proportion of the high-frequency energy to the low-frequency energy in the audio frame to be denoised.
In one possible implementation manner, the extracting a first noise spectrum of the audio frame to be denoised according to a ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised further includes:
and adjusting the first noise spectrum according to the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum.
In one possible implementation, the adjusting the first noise spectrum according to a ratio of high-frequency energy to low-frequency energy in the first noise spectrum includes:
determining the voice existence probability of the first noise spectrum according to the proportion of high-frequency energy and low-frequency energy in the first noise spectrum;
and extracting noise in the first noise spectrum according to the voice existence probability to obtain an adjusted first noise spectrum.
In one possible implementation, the determining the speech existence probability of the first noise spectrum according to the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum includes:
and in response to the fact that the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum is larger than a target threshold, determining the voice existence probability corresponding to the high-frequency band as a first voice existence probability, and determining the voice existence probability corresponding to the low-frequency band as a second voice existence probability, wherein the first voice existence probability is smaller than the second voice existence probability.
In one possible implementation, the determining of the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum includes:
acquiring a medium-high frequency energy mean value and a low-frequency energy mean value of the first noise spectrum;
and taking the ratio of the high-frequency energy mean value to the low-frequency energy mean value as the proportion of the high-frequency energy to the low-frequency energy in the first noise spectrum.
In one possible implementation, the determining of the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum includes:
determining whether noise exists in each of a plurality of frequency points of the first noise spectrum;
and determining the proportion of high-frequency energy and low-frequency energy in the first noise spectrum according to the noise existence result of each frequency point.
In a possible implementation manner, the determining, according to the noise existence result of each frequency point, a ratio of high-frequency energy to low-frequency energy in the first noise spectrum includes:
and according to the weight of each frequency point, carrying out weighted summation on the noise existence result of each frequency point to obtain the proportion of high-frequency energy and low-frequency energy in the first noise spectrum.
In one possible implementation manner, the determining whether noise exists in each of the frequency points of the first noise spectrum includes:
acquiring the magnitude relation between the amplitude of each frequency point in the multiple frequency points of the first noise spectrum and an amplitude threshold;
for any frequency point, responding to the condition that the amplitude of the frequency point is greater than or equal to an amplitude threshold value, and determining that the frequency point comprises noise;
and for any frequency point, responding to the condition that the amplitude of the frequency point is smaller than an amplitude threshold value, and determining that the frequency point does not contain noise.
In one aspect, an audio noise reduction apparatus is provided, the apparatus comprising:
the extraction module is used for extracting a first noise spectrum of the audio frame to be denoised according to the proportion of low-frequency energy and high-frequency energy in the audio frame to be denoised;
a processing module, configured to process the first noise spectrum based on a reverberation transfer function to obtain a second noise spectrum, where the second noise spectrum includes a signal attenuation component in the audio frame to be denoised
And the noise reduction module is used for removing the noise in the audio frame to be subjected to noise reduction according to the second noise spectrum to obtain a target audio frame.
In one possible implementation, the processing module is configured to:
acquiring a product of the first noise spectrum and the reverberation transfer function, and taking the product as a third noise spectrum, wherein the third noise spectrum is a signal attenuation component in the audio frame to be denoised;
acquiring the sum of the first noise spectrum and the third noise spectrum, and taking the sum of the first noise spectrum and the third noise spectrum as the second noise spectrum.
In one possible implementation, the obtaining of the reverberation transfer function includes:
acquiring environmental information, and receiving the receiver position, the sound source position and the sound velocity of the audio frame to be denoised;
and determining the reverberation transfer function according to the environment information, the receiver position for receiving the audio frame to be denoised, the sound source position and the sound velocity.
In one possible implementation, the extraction module is configured to:
determining the noise existence probability of the audio frame to be denoised according to the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised;
extracting a first noise spectrum of the audio frame to be denoised according to the noise existence probability;
in one possible implementation, the extraction module is configured to:
determining a first noise existence probability as a noise existence probability of the audio frame to be denoised in response to the fact that the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised is greater than a ratio threshold;
and determining a second noise existence probability as the noise existence probability of the audio frame to be subjected to noise reduction in response to the fact that the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be subjected to noise reduction is smaller than or equal to a ratio threshold, wherein the first noise existence probability is smaller than the second noise existence probability.
In one possible implementation manner, the determining of the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be denoised includes:
acquiring a low-frequency energy mean value and a high-frequency energy mean value in the audio frame to be denoised according to the frequency spectrum of the audio frame to be denoised;
and determining the ratio of the low-frequency energy mean value to the high-frequency energy mean value as the proportion of the high-frequency energy to the low-frequency energy in the audio frame to be denoised.
In one possible implementation manner, the extracting a first noise spectrum of the audio frame to be denoised according to a ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised further includes:
and adjusting the first noise spectrum according to the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum.
In one possible implementation, the adjusting the first noise spectrum according to a ratio of high-frequency energy to low-frequency energy in the first noise spectrum includes:
determining the voice existence probability of the first noise spectrum according to the proportion of high-frequency energy and low-frequency energy in the first noise spectrum;
and extracting noise in the first noise spectrum according to the voice existence probability to obtain an adjusted first noise spectrum.
In one possible implementation, the determining the speech existence probability of the first noise spectrum according to the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum includes:
and in response to the fact that the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum is larger than a target threshold, determining the voice existence probability corresponding to the high-frequency band as a first voice existence probability, and determining the voice existence probability corresponding to the low-frequency band as a second voice existence probability, wherein the first voice existence probability is smaller than the second voice existence probability.
In one possible implementation, the determining of the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum includes:
acquiring a medium-high frequency energy mean value and a low-frequency energy mean value of the first noise spectrum;
and taking the ratio of the high-frequency energy mean value to the low-frequency energy mean value as the proportion of the high-frequency energy to the low-frequency energy in the first noise spectrum.
In one possible implementation, the determining of the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum includes:
determining whether noise exists in each of a plurality of frequency points of the first noise spectrum;
and determining the proportion of high-frequency energy and low-frequency energy in the first noise spectrum according to the noise existence result of each frequency point.
In a possible implementation manner, the determining, according to the noise existence result of each frequency point, a ratio of high-frequency energy to low-frequency energy in the first noise spectrum includes:
and according to the weight of each frequency point, carrying out weighted summation on the noise existence result of each frequency point to obtain the proportion of high-frequency energy and low-frequency energy in the first noise spectrum.
In one possible implementation manner, the determining whether noise exists in each of the frequency points of the first noise spectrum includes:
acquiring the magnitude relation between the amplitude of each frequency point in the multiple frequency points of the first noise spectrum and an amplitude threshold;
for any frequency point, responding to the condition that the amplitude of the frequency point is greater than or equal to an amplitude threshold value, and determining that the frequency point comprises noise;
and for any frequency point, responding to the condition that the amplitude of the frequency point is smaller than an amplitude threshold value, and determining that the frequency point does not contain noise.
In one aspect, a computer device is provided and includes one or more processors and one or more memories having at least one program code stored therein, the program code being loaded and executed by the one or more processors to implement the operations performed by the above-described audio noise reduction method.
In one aspect, a computer-readable storage medium is provided, in which at least one program code is stored, the program code being loaded and executed by a processor to implement the operations performed by the above-mentioned audio noise reduction method.
In one aspect, a computer program product or computer program is provided that includes one or more program codes stored in a computer-readable storage medium. The one or more program codes can be read by one or more processors of the computer device from a computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the computer device can execute the audio noise reduction method of any one of the above possible embodiments.
The beneficial effects brought by the technical scheme provided by the embodiment of the disclosure at least can include:
according to the technical scheme provided by the embodiment of the disclosure, the characteristics of low-frequency and high-frequency distribution of the impact noise are considered, the extracted first noise spectrum can accurately distinguish the human voice and the noise, in addition, the signal attenuation components are further estimated through the reverberation transfer function, after noise reduction, noise estimation is not needed through multi-frame audio information, so that delay is small, the impact noise can be effectively inhibited under the small delay frame number, the signal attenuation components are effectively removed from a target audio frame after noise reduction, and the noise reduction effect is better.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a schematic diagram of an audio noise reduction system provided by an embodiment of the present disclosure;
fig. 2 is a flowchart of an audio denoising method provided by an embodiment of the present disclosure;
fig. 3 is a flowchart of an audio denoising method provided by an embodiment of the present disclosure;
FIG. 4 is a speech spectrogram containing impact noise according to an embodiment of the present disclosure;
FIG. 5 is a graph of an estimated noise spectrum provided by the related art;
FIG. 6 is a graph of an estimated noise spectrum provided by embodiments of the present disclosure;
fig. 7 is a graph of an adjusted noise spectrum provided by an embodiment of the present disclosure;
FIG. 8 is a spectrogram of a noise-reduced target audio frame provided by the related art;
fig. 9 is a spectrogram of a noise-reduced target audio frame according to an embodiment of the present disclosure;
fig. 10 is a schematic structural diagram of an audio noise reduction device provided in an embodiment of the present disclosure;
fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present disclosure;
fig. 12 is a schematic structural diagram of a server according to an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the present disclosure more apparent, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of an audio noise reduction system provided by an embodiment of the present disclosure, and referring to fig. 1, the audio noise reduction system may include a speech acquisition device 110 and a computer device 120, or may be the computer device 120 alone.
When the audio noise reduction system includes the speech capturing device 110 and the computer device 120, the speech capturing device 110 may be connected to the computer device 120 through a network or a data line. The voice capturing device 110 may have a voice capturing function, and may capture audio to be denoised. In one possible implementation, the application scenario implemented by the present disclosure may be a relatively sharp impact noise in the environment of the communication system, for example, an impact noise of a pen falling or a keyboard being strongly knocked. Of course, it can also be a relatively sharp impact noise in other application scenarios. The computer device 120 may have an audio processing function, and may perform noise reduction processing on the audio to be noise reduced acquired by the voice acquisition device 110.
When the audio noise reduction system only includes the computer device 120, the computer device 120 may have a voice capture function and an audio processing function, and the computer device 120 may capture audio to be noise reduced in various environments and perform noise reduction processing on the audio to be noise reduced.
In one possible implementation manner, the computer device 120 may be a terminal or a server, which is not limited in this disclosure.
Fig. 2 is a flowchart of an audio denoising method provided by an embodiment of the present disclosure, which is applied to a computer device, and referring to fig. 2, the method includes:
201. and extracting a first noise spectrum of the audio frame to be denoised according to the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised.
202. And processing the first noise spectrum based on the reverberation transfer function to obtain a second noise spectrum, wherein the second noise spectrum comprises signal attenuation components in the audio frame to be subjected to noise reduction.
203. And removing the noise in the audio frame to be denoised according to the second noise spectrum to obtain a target audio frame.
According to the technical scheme provided by the embodiment of the disclosure, the characteristics of low-frequency and high-frequency distribution of the impact noise are considered, the extracted first noise spectrum can accurately distinguish the human voice and the noise, in addition, the signal attenuation components are further estimated through the reverberation transfer function, after noise reduction, noise estimation is not needed through multi-frame audio information, so that delay is small, the impact noise can be effectively inhibited under the small delay frame number, the signal attenuation components are effectively removed from a target audio frame after noise reduction, and the noise reduction effect is better.
In one possible implementation, the processing the first noise spectrum based on the reverberation transfer function to obtain a second noise spectrum includes:
acquiring a product of the first noise spectrum and the reverberation transfer function, and taking the product as a third noise spectrum, wherein the third noise spectrum is a signal attenuation component in the audio frame to be denoised;
and acquiring the sum of the first noise spectrum and the third noise spectrum, and taking the sum of the first noise spectrum and the third noise spectrum as the second noise spectrum.
In one possible implementation, the obtaining process of the reverberation transfer function includes:
acquiring environmental information, and receiving the receiver position, the sound source position and the sound velocity of the audio frame to be denoised;
and determining the reverberation transfer function according to the environment information, the receiver position for receiving the audio frame to be subjected to noise reduction, the sound source position and the sound velocity.
In one possible implementation manner, the extracting a first noise spectrum of the audio frame to be denoised according to a ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised includes:
determining the noise existence probability of the audio frame to be denoised according to the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised;
extracting a first noise spectrum of the audio frame to be denoised according to the noise existence probability;
in one possible implementation manner, the determining the noise existence probability of the audio frame to be denoised according to the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be denoised comprises:
determining a first noise existence probability as a noise existence probability of the audio frame to be denoised in response to the fact that the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised is greater than a ratio threshold;
and determining a second noise existence probability as the noise existence probability of the audio frame to be subjected to noise reduction in response to the condition that the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be subjected to noise reduction is smaller than or equal to a ratio threshold, wherein the first noise existence probability is smaller than the second noise existence probability.
In one possible implementation manner, the determining of the ratio of the low-frequency energy and the high-frequency energy in the audio frame to be denoised includes:
acquiring a low-frequency energy mean value and a high-frequency energy mean value in the audio frame to be denoised according to the frequency spectrum of the audio frame to be denoised;
and determining the ratio of the low-frequency energy average value to the high-frequency energy average value as the ratio of the high-frequency energy to the low-frequency energy in the audio frame to be denoised.
In one possible implementation manner, the extracting a first noise spectrum of the audio frame to be denoised according to a ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised further includes:
and adjusting the first noise spectrum according to the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum.
In one possible implementation, the adjusting the first noise spectrum according to a ratio of high-frequency energy to low-frequency energy in the first noise spectrum includes:
determining the speech existence probability of the first noise spectrum according to the proportion of high-frequency energy and low-frequency energy in the first noise spectrum;
and extracting noise in the first noise spectrum according to the voice existence probability to obtain an adjusted first noise spectrum.
In one possible implementation, the determining the speech existence probability of the first noise spectrum according to the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum includes:
and in response to the fact that the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum is larger than a target threshold value, determining the existence probability of the voice corresponding to the high-frequency band as a first voice existence probability, and determining the existence probability of the voice corresponding to the low-frequency band as a second voice existence probability, wherein the first voice existence probability is smaller than the second voice existence probability.
In one possible implementation, the determining of the ratio of the high frequency energy and the low frequency energy in the first noise spectrum includes:
acquiring a medium-high frequency energy mean value and a low-frequency energy mean value of the first noise spectrum;
and taking the ratio of the high-frequency energy mean value to the low-frequency energy mean value as the proportion of the high-frequency energy to the low-frequency energy in the first noise spectrum.
In one possible implementation, the determining of the ratio of the high frequency energy and the low frequency energy in the first noise spectrum includes:
determining whether noise exists in each of a plurality of frequency points of the first noise spectrum;
and determining the proportion of the high-frequency energy and the low-frequency energy in the first noise spectrum according to the noise existence result of each frequency point.
In a possible implementation manner, the determining, according to the noise existence result of each frequency point, a ratio of high-frequency energy to low-frequency energy in the first noise spectrum includes:
and according to the weight of each frequency point, carrying out weighted summation on the noise existence result of each frequency point to obtain the proportion of high-frequency energy and low-frequency energy in the first noise spectrum.
In one possible implementation, the determining whether noise exists in each of the plurality of frequency bins of the first noise spectrum includes:
acquiring the magnitude relation between the amplitude of each frequency point in the multiple frequency points of the first noise spectrum and an amplitude threshold;
for any frequency point, in response to the fact that the amplitude of the frequency point is larger than or equal to the amplitude threshold value, determining that the frequency point comprises noise;
and for any frequency point, responding to the condition that the amplitude of the frequency point is smaller than the amplitude threshold value, and determining that the frequency point does not contain noise.
Fig. 3 is a flowchart of an audio denoising method provided in an embodiment of the present disclosure, and referring to fig. 3, the method may include:
301. the computer device obtains an audio frame to be denoised.
In the embodiment of the present disclosure, the computer device may be a terminal or a server. The audio frame to be denoised can be one frame of audio with noise acquired in various scenes, for example, in vehicles such as automobiles, ships, airplanes and the like, the acquired speech to be denoised can include wind noise in automobile windowing high-speed driving or rain noise in automobile driving in rainy days; in a home environment, the acquired audio frame to be denoised may include noise of a television or rotation noise of a washing machine, which is not limited in the embodiment of the present disclosure.
The way of acquiring the audio frame to be denoised by the computer device may be various, and in one possible implementation, the determination process may include any one of the following ways one to three:
in the first mode, the computer device directly acquires the audio frame to be denoised.
The computer device may have a voice collecting function, and the computer device may directly collect sound to obtain the audio frame to be denoised.
And secondly, the computer equipment acquires the audio frame to be denoised, which is acquired by the voice acquisition equipment.
The computer device may be connected to the voice collecting device through a network or a data line to obtain the to-be-denoised audio frame collected by the voice collecting device, and the voice collecting device may be any kind of device having a voice collecting function, which is not limited in the embodiments of the present disclosure.
And thirdly, the computer equipment can extract the audio frame to be subjected to noise reduction from the database.
In the third mode, the audio frame to be denoised may be stored in a database, and when the computer device needs to process the audio frame to be denoised, the audio frame to be denoised is extracted from the database.
It should be noted that the computer device may obtain the audio to be denoised, so as to denoise each frame of the audio to be denoised, obtain a target audio frame corresponding to each frame, and further obtain the target audio after denoising the audio to be denoised. The process of denoising an audio frame to be denoised by the computer device is only described here, and the process of denoising other audio frames to be denoised is similar to the above process, and is not repeated here.
302. The computer device determines a ratio of low frequency energy to high frequency energy in an audio frame to be denoised.
When the audio frame is processed, because the noise in the audio frame to be denoised is uniformly distributed on the frequency domain, the noise is converted into the frequency domain for calculation, the calculation difficulty is reduced, simultaneously, the noise can be more effectively estimated, the noise is more effectively removed, and the denoising effect is improved. Therefore, the computer device may first obtain the frequency spectrum of the audio frame to be denoised, and then determine the ratio according to the frequency spectrum.
When the computer device obtains the above ratio, the computer device may analyze the low frequency energy and the high frequency energy in the audio frame to be denoised according to the frequency spectrum, so as to determine the ratio of the two. This process may be implemented in a number of ways, one possible implementation of which is provided below.
In this implementation manner, the computer device may obtain a low-frequency energy mean value and a high-frequency energy mean value in the audio frame to be denoised according to the frequency spectrum of the audio frame to be denoised, and determine a ratio of the low-frequency energy mean value and the high-frequency energy mean value as a ratio of high-frequency energy and low-frequency energy in the audio frame to be denoised.
The low-frequency energy mean value can represent the condition of low-frequency energy in the audio frame, and the high-frequency energy mean value can represent the condition of high-frequency energy in the audio frame. Furthermore, the computer device obtains the ratio of the low frequency energy and the high frequency energy, and the ratio can be used as the ratio of the low frequency energy to the high frequency energy.
For example, the computer device performs time-frequency domain conversion on the audio frame to be denoised, converts the time-domain signal of the audio frame to be denoised into a frequency spectrum, assuming Fast Fourier Transform (FFT), length 512, Overlap and add (overlay)&add) length 256, sample rate 16kHz for example, audio amplitudeThe square degree is represented by Ya2, and the computer device can obtain the low-frequency energy mean value E according to the frequency spectrum of the audio frame to be denoised through the following formula one and formula twolowAnd mean value of high frequency energy Ehigh
Figure BDA0002700190560000121
Figure BDA0002700190560000122
Obtaining low-frequency energy mean value E by computer equipmentlowAnd mean value of high frequency energy EhighThen, the ratio slope of the two can be obtained by the following formula three:
slope=Elow/Ehignequation three
For example, the process may also be implemented in other manners, for example, by obtaining a sum of audio amplitudes of low-frequency points and a sum of audio amplitudes of high-frequency points in the spectrum, and taking a ratio of the two sums as the ratio. The embodiments of the present disclosure are not limited thereto.
303. And the computer equipment determines the noise existence probability of the audio frame to be subjected to noise reduction according to the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be subjected to noise reduction.
Through the steps, the computer equipment obtains the proportion of the low-frequency energy and the high-frequency energy in the audio frame to be denoised, the proportion can represent the distribution condition of the low-frequency energy and the high-frequency energy in the audio frame to be denoised, and because the impact noise generally has short occurrence time, concentrated energy and uniform frequency distribution, if the low-frequency energy is more, more voice can be indicated, and if the high-frequency energy is more, more impact noise can be indicated. Thus, by this ratio, the computer device may determine the noise presence probability of the audio frame to be denoised.
In one possible implementation, a proportion threshold may be set, and the noise existence probability may be determined by comparing the magnitude relationship between the proportion and the proportion threshold. The magnitude relation of the two can include two kinds, and different existing probabilities of noise can be set respectively.
In particular, in the implementation, in a first case, the computer device may determine the first noise existence probability as the noise existence probability of the audio frame to be denoised in response to a ratio of low frequency energy to high frequency energy in the audio frame to be denoised being greater than a ratio threshold. In a second case, the computer device may determine a second noise existence probability as the noise existence probability of the audio frame to be denoised in response to the ratio of the low frequency energy and the high frequency energy in the audio frame to be denoised being less than or equal to a ratio threshold, the first noise existence probability being less than the second noise existence probability.
The proportional threshold may be set by a relevant technician as required, and may be an empirical threshold, and the specific value of the proportional threshold is not limited in the embodiment of the present disclosure.
For example, the ratio is still expressed by slope as an example, assuming that the ratio threshold is T, if slope > T, it indicates that low-frequency energy is concentrated in the signal, the probability of being human voice is high, and thus the noise existence probability can be set to be lower. If slope < T, it means that the high frequency energy in the signal is concentrated, the probability of noise is high, and therefore, the probability of noise existence can be set to be higher.
The noise existence probability is inversely related to the ratio of the low-frequency energy to the high-frequency energy, in a possible implementation manner, a plurality of candidate noise existence probabilities may be preset, when slope > T, the computer device may set the noise existence probability to be the smallest of the plurality of candidate noise existence probabilities, and when slope < T, the noise existence probability may be set to be other larger candidate noise existence probabilities.
304. And the computer equipment extracts a first noise spectrum of the audio frame to be subjected to noise reduction according to the noise existence probability.
If the computer device determines the probability of the existence of noise, a first noise spectrum can be extracted from the audio frame to be denoised. The extraction process of the first noise spectrum may be implemented in various ways, and in one possible implementation, the extraction process of the first noise spectrum may be implemented by optimal Modified Log-Spectral Amplitude Estimator (OMLSA) calculation and Improved least control Recursive Averaging algorithm (IMCRA) calculation.
In the calculation process of the OMLSA and the IMCRA, the noise existence probability determined in the steps is applied, and the first noise spectrum of the audio frame to be subjected to noise reduction is extracted and obtained through the audio frame information before and after the audio frame to be subjected to noise reduction. Specifically, the computer device may obtain a power spectrum after an audio frame to be denoised is smoothed and a power spectrum of the previous frame, so as to track a minimum value of the smoothed power spectrum, obtain a speech existence situation by using a noise existence probability, compensate for a minimum value estimation of a first noise spectrum, finally obtain a more accurate first noise spectrum, and adjust the first noise spectrum by OMLSA calculation.
For example, fig. 4 shows a speech spectrogram containing impulse noise, which generally occurs in a short time, is concentrated in energy, and has a relatively uniform frequency distribution, as shown in fig. 4. With the noise reduction method provided by the related art, the estimated noise spectrum can be as shown in fig. 5, and when the delay frame number is set to a smaller value, many human voices are identified as noise, and are not suitable for an application scenario with high delay requirement, and the noise reduction effect is poor. Through the noise reduction mode provided by the disclosure, the obtained first noise spectrum can be as shown in fig. 6, so that the human voice components in the impact noise estimation are greatly reduced, and the noise reduction effect is better.
305. The computer device adjusts the first noise spectrum according to a ratio of high frequency energy to low frequency energy in the first noise spectrum.
After the computer device obtains the first noise spectrum corresponding to the audio frame to be denoised, the noise spectrum may further include a small part of human voices, for example, utterances of types [ s ], [ z ], [ dz ], [ ts ], and then the human voice included in the first noise spectrum may be analyzed by analyzing the distribution of high and low frequency energy in the first noise spectrum, so as to adjust the first noise spectrum to remove the human voices from the first noise spectrum, thereby obtaining a more accurate first noise spectrum.
In one possible implementation, the computer device may adjust the first noise spectrum by steps one and two as follows.
Step one, determining the voice existence probability of the first noise spectrum according to the proportion of high-frequency energy and low-frequency energy in the first noise spectrum.
In the first step, whether the first noise spectrum includes the human voice or not can be determined by determining whether a large amount of human voice is included or not by representing how much the noise is occupied by the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum.
In one possible implementation, a target threshold may be set to measure whether high frequency energy concentrations are present in the first noise spectrum to determine whether [ s ], [ z ], [ dz ], [ ts ] type utterances are present. In this implementation, the computer device may determine, in response to a ratio of high-frequency energy to low-frequency energy in the first noise spectrum being greater than a target threshold, a speech presence probability corresponding to a high-frequency band as a first speech presence probability and a speech presence probability corresponding to a low-frequency band as a second speech presence probability, the first speech presence probability being less than the second speech presence probability.
The high-frequency energy of the impact noise is high, so that the voice existence probability corresponding to the high-frequency band can be set to be a small value, and in order to avoid reducing voice damage, the voice existence probability corresponding to the low-frequency band is set to be a large value.
In another possible implementation manner, the computer device may determine, in response to a ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum being greater than a target threshold, that a noise existence probability in the first noise spectrum is a target noise existence probability, which is a candidate noise existence probability with a smallest value.
The proportion of the high-frequency energy and the low-frequency energy in the first noise spectrum can be obtained in multiple ways, two possible implementation ways are provided in the next year, the computer device can determine the proportion in any way, and can also determine the proportion in other ways.
In the first mode, the computer device obtains the mean value of the high-frequency energy and the mean value of the low-frequency energy in the first noise spectrum, and the ratio of the mean value of the high-frequency energy to the mean value of the low-frequency energy is used as the proportion of the high-frequency energy to the low-frequency energy in the first noise spectrum.
For example, assume that λ is used for the first noise spectrum estimated in the above step 304tAnd (4) showing. The low frequency energy average Et of the first noise spectrum can be obtained by the following equations four and fivelowAnd mean value of high frequency energy Ethigh
Figure BDA0002700190560000151
Figure BDA0002700190560000152
The computer equipment can obtain the ratio slope of the high-frequency energy mean value and the low-frequency energy mean value through the following formula sixt
slopet=Ethigh/EtlowEquation six
If slopet>TtThe high-frequency energy concentration of the signal at this time is shown as [ s ]]、[z]、[dz]、[ts]The probability of the type of utterance is high, the noise existence probability may be set to a minimum value λt=λtmin. Wherein, the TtThe target threshold may be set by a relevant technician as required, and may be an empirical threshold, and the specific value of the target threshold is not limited in the embodiment of the present disclosure.
And secondly, determining whether noise exists in each frequency point of the multiple frequency points of the first noise spectrum by the computer equipment, and determining the proportion of high-frequency energy and low-frequency energy in the first noise spectrum according to the noise existence result of each frequency point.
In the second mode, when the above proportion is obtained, the determination may be performed according to whether each frequency point is noise, wherein when it is determined whether each frequency point has noise, the computer device may obtain a magnitude relation between the amplitude of each frequency point in the multiple frequency points of the first noise spectrum and the amplitude threshold, for any frequency point, it is determined that the frequency point includes noise in response to that the amplitude of the frequency point is greater than or equal to the amplitude threshold, and for any frequency point, it is determined that the frequency point does not include noise in response to that the amplitude of the frequency point is less than the amplitude threshold.
For example, a first noise spectrum λ for each frametAnd each frequency point is judged once, and the judgment can be realized by the following formula seven:
Figure BDA0002700190560000153
wherein, the Tt1Is the amplitude threshold. And judging whether the noise exists or not by judging whether the amplitude of each frequency point is smaller than an amplitude threshold value or not.
After determining the noise existence result of each frequency point, the computer device may perform weighted summation on the noise existence result of each frequency point according to the weight of each frequency point to obtain the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum.
For example, a weight array w _ i, i is 0, 1, …, 256, where the weight of 0-50 needs to be smaller and the weight of 150-200 needs to be larger (for example, the weight of 0-50 frequency points is 0.8, 51-100: 0.85, 101-150: 0.95, 151-200: 0.97, 201-256: 0.97). For the result that the noise exists at each frequency point is determined, weighted summation can be performed through the following formula eight, and the ratio D of the high-frequency energy and the low-frequency energy in the first noise spectrum is obtained.
Figure BDA0002700190560000161
If D is>TdIt is said that the frame impact noise energy level is high, and thus, the speech existence probability PH1(i) may be set to 0.7, i to 0, …, 126, PH1(i) to 0.1, i to 127, …, 257. Wherein,TdI.e. the target threshold. Wherein, Tt1And TdThe former is used to judge whether some frequency point contains impact noise, and the latter is used to judge the energy level of the current frame impact noise. Because the high-frequency energy of the impact noise is high, the PH1 can be set to be a small value, and the PH1 value corresponding to the low frequency band is relatively large in order to avoid reducing voice damage.
For example, as shown in fig. 7, after the adjustment in step 305, more human voice components are removed from the adjusted first noise spectrum, and more accurately, the noise removal step is performed using such a first noise spectrum, so that a better noise reduction effect can be achieved.
And step two, extracting noise in the first noise spectrum according to the existence probability of the voice to obtain an adjusted first noise spectrum.
After the computer device determines the speech existence probability, noise may be extracted again to obtain the adjusted first noise spectrum, it should be noted that the process may also be implemented by OMLSA calculation and IMCRA calculation, which is the same as step 304 described above, and details are not repeated here.
306. The computer device processes the first noise spectrum based on the reverberation transfer function to obtain a second noise spectrum, wherein the second noise spectrum comprises signal attenuation components in the audio frame to be denoised.
After the computer device clearly distinguishes the human voice and the noise part in the audio frame to be denoised, and obtains the first noise spectrum, the computer device can further process the first noise spectrum, add the reverberation component into the first noise spectrum for removing, and improve the quality of the target audio frame, considering that the sound emitted by the sound source is not necessarily directly transmitted to the sound collection device, and the reverberation component, namely, the signal attenuation component, can also be included.
For reverberation, the directly arriving sound emitted by a sound source is a direct sound that always reaches the human ear first, because the direct sound has a shorter path than the reflected sound. In addition to the direct sound, the reflected sound forms reverberant sound, increasing the sound pressure level in the room. The reverberation phenomenon is more prominent in the room, in which the sound wave energy emitted from the sound source is gradually attenuated by being continuously absorbed by the wall surface during the propagation process. The phenomenon that sound waves are reflected back and forth in various directions and gradually attenuated is called room reverberation.
In a possible implementation manner, when the electronic device processes the first noise spectrum, the first noise spectrum may be the first noise spectrum obtained in step 304, or the first noise spectrum obtained after the adjustment in step 305.
Specifically, the electronic device may acquire a product of the first noise spectrum and the reverberation transfer function as a third noise spectrum that is a signal attenuation component in the audio frame to be denoised, and acquire a sum of the first noise spectrum and the third noise spectrum as a second noise spectrum. This second noise spectrum thus includes not only noise but also signal attenuation components.
Where the reverberation transfer function may be different in different environments, the reverberation transfer function is also related to other factors of sound propagation. The following provides an obtaining process of a reverberation transfer function, and the embodiment of the present disclosure may also obtain the reverberation transfer function in other ways, and the specific obtaining way is not limited.
Specifically, the electronic device may obtain environment information, a receiver position, a sound source position, and a sound velocity of the audio frame to be noise-reduced, and determine the reverberation transfer function according to the environment information, the receiver position, the sound source position, and the sound velocity of the audio frame to be noise-reduced.
The environment information may include the size information of the room, and may also include other factors that affect the sound velocity or reflection condition, such as the type of wall, etc. Of course, the reverberation transfer function may also be determined based on more information, such as microphone direction, microphone type, sampling frequency, etc., which is not limited by the embodiments of the present disclosure.
For example, the electronic device may be implemented by habets' rir computer program, the rir computer program being a Room Impulse Response Generator (Room Impulse Response Generator) that is capable of determining a reverberation transfer function, i.e., a Room Impulse Response, based on environmental information, receiver position, sound source position, and sound velocity information.
The above only exemplarily provides a possible implementation of obtaining a reverberation transfer function, which may also be implemented by other ways, for example, a cepstral filtering based method, a neural network based method, or a maximum likelihood estimation based method, which is not limited by the embodiments of the present disclosure.
307. And removing the noise in the audio frame to be denoised by the computer equipment according to the second noise spectrum to obtain the target audio frame.
And the computer equipment acquires the second noise spectrum, namely removing noise to obtain the target audio frame subjected to noise reduction.
Specifically, the computer device may perform OMLSA calculation and IMCRA calculation on the audio frame to be denoised to obtain the target audio frame. The same process as that in step 304 is repeated here.
For example, as shown in fig. 8 and 9, fig. 8 shows a noise spectrum obtained by multiplying the first noise spectrum by the target transfer function. The noise spectrum is a spectrum corresponding to the attenuation band. Fig. 9 shows a second noise spectrum obtained by combining the result of the multiplication with the first noise spectrum. As can be seen from fig. 9, the second noise spectrum estimated by the present disclosure includes a large amount of noise and attenuation segments, and the noise reduction effect can be greatly improved by removing these components.
And the noise reduction mode provided by the disclosure performs noise reduction according to the characteristics of low-high frequency distribution of the impact noise without referring to audio information of many frames, so that the impact noise can be effectively suppressed under a few delay frames, and the voice is less damaged.
In a possible implementation manner, after the step 304, the computer device may further directly perform processing based on the first noise spectrum obtained in the step 304 to obtain a second noise spectrum, and then perform the noise removing step to obtain the target audio frame. The embodiments of the present disclosure are not limited thereto.
According to the technical scheme provided by the embodiment of the disclosure, the characteristics of low-frequency and high-frequency distribution of the impact noise are considered, the extracted first noise spectrum can accurately distinguish the human voice and the noise, in addition, the signal attenuation components are further estimated through the reverberation transfer function, after noise reduction, noise estimation is not needed through multi-frame audio information, so that delay is small, the impact noise can be effectively inhibited under the small delay frame number, the signal attenuation components are effectively removed from a target audio frame after noise reduction, and the noise reduction effect is better.
Fig. 10 is a schematic structural diagram of an audio noise reduction device provided in an embodiment of the present disclosure, and referring to fig. 10, the device includes:
a determining module 1001, configured to determine a noise existence probability of the audio frame to be denoised according to a ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised;
an extracting module 1002, configured to extract a noise spectrum of the audio frame to be denoised according to the noise existence probability;
and a denoising module 1003, configured to remove noise in the audio frame to be denoised according to the noise spectrum, to obtain a target audio frame.
In one possible implementation, the determining module 1001 is configured to:
determining a first noise existence probability as a noise existence probability of the audio frame to be denoised in response to the fact that the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised is greater than a ratio threshold;
and determining a second noise existence probability as the noise existence probability of the audio frame to be subjected to noise reduction in response to the condition that the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be subjected to noise reduction is smaller than or equal to a ratio threshold, wherein the first noise existence probability is smaller than the second noise existence probability.
In one possible implementation, the determining module 1001 is further configured to:
acquiring a low-frequency energy mean value and a high-frequency energy mean value in the audio frame to be denoised according to the frequency spectrum of the audio frame to be denoised;
and determining the ratio of the low-frequency energy average value to the high-frequency energy average value as the ratio of the high-frequency energy to the low-frequency energy in the audio frame to be denoised.
In one possible implementation, the noise reduction module 1003 is configured to:
adjusting the noise spectrum according to the proportion of high-frequency energy and low-frequency energy in the noise spectrum;
and removing the noise in the audio frame to be denoised according to the adjusted noise spectrum to obtain the target audio frame.
In one possible implementation, the noise reduction module 1003 is configured to:
determining the voice existence probability of the noise spectrum according to the proportion of high-frequency energy and low-frequency energy in the noise spectrum;
and extracting the noise in the noise spectrum according to the voice existence probability to obtain an adjusted noise spectrum.
In one possible implementation, the noise reduction module 1003 is configured to:
and in response to the fact that the ratio of the high-frequency energy to the low-frequency energy in the noise spectrum is larger than a target threshold value, determining the voice existence probability corresponding to the high-frequency band as a first voice existence probability, and determining the voice existence probability corresponding to the low-frequency band as a second voice existence probability, wherein the first voice existence probability is smaller than the second voice existence probability.
In one possible implementation, the noise reduction module 1003 is configured to:
acquiring a medium-high frequency energy mean value and a low-frequency energy mean value of the noise spectrum;
and taking the ratio of the high-frequency energy mean value to the low-frequency energy mean value as the proportion of the high-frequency energy to the low-frequency energy in the noise spectrum.
In one possible implementation, the noise reduction module 1003 is configured to:
determining whether each frequency point in a plurality of frequency points of the noise spectrum has noise;
and determining the proportion of high-frequency energy and low-frequency energy in the noise spectrum according to the noise existence result of each frequency point.
In a possible implementation manner, the denoising module 1003 is configured to perform weighted summation on the noise existence result of each frequency point according to the weight of each frequency point, so as to obtain a ratio of high-frequency energy to low-frequency energy in the noise spectrum.
In one possible implementation, the noise reduction module 1003 is configured to:
acquiring the magnitude relation between the amplitude of each frequency point in a plurality of frequency points of the noise spectrum and an amplitude threshold;
for any frequency point, in response to the fact that the amplitude of the frequency point is larger than or equal to the amplitude threshold value, determining that the frequency point comprises noise;
and for any frequency point, responding to the condition that the amplitude of the frequency point is smaller than the amplitude threshold value, and determining that the frequency point does not contain noise.
The device that this disclosed embodiment provided, the characteristics of the low high frequency distribution of impact noise are considered, the first noise spectrum of extraction can distinguish people's voice and noise accurately, in addition, further come out signal attenuation composition estimation through reverberation transfer function, fall the back of making an uproar like this, need not multiframe audio information and carry out noise estimation to the delay is less, can effectively restrain impact noise under less delay frame number, and effectively remove signal attenuation composition in the target audio frame after falling the noise, the noise reduction effect is better.
It should be noted that: in the audio noise reduction apparatus provided in the above embodiment, when reducing noise, only the division of the above functional modules is used for illustration, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the above described functions. In addition, the audio noise reduction device and the audio noise reduction method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.
Fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present disclosure. The terminal 1100 may be: a smart phone, a tablet computer, an MP3(Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4) player, a notebook computer or a desktop computer. Terminal 1100 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so forth. The terminal can also be a voice intelligent terminal embedded device installed on the central control.
In general, terminal 1100 includes: one or more processors 1101 and one or more memories 1102.
Processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1101 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1101 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, the processor 1101 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.
Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one program code for execution by processor 1101 to implement the audio noise reduction methods provided by method embodiments in the present disclosure.
In some embodiments, the terminal 1100 may further include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102 and peripheral interface 1103 may be connected by a bus or signal lines. Various peripheral devices may be connected to the peripheral interface 1103 by buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, display screen 1105, camera assembly 1106, audio circuitry 1107, positioning assembly 1108, and power supply 1109.
The peripheral interface 1103 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1101 and the memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1101, the memory 1102 and the peripheral device interface 1103 may be implemented on separate chips or circuit boards, which is not limited by this embodiment.
The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 1104 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electric signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electric signal. Optionally, the radio frequency circuit 1104 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1104 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 1104 may also include NFC (Near Field Communication) related circuits, which are not limited by this disclosure.
The display screen 1105 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1105 is a touch display screen, the display screen 1105 also has the ability to capture touch signals on or over the surface of the display screen 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this point, the display screen 1105 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1105 may be one, providing the front panel of terminal 1100; in other embodiments, the display screens 1105 can be at least two, respectively disposed on different surfaces of the terminal 1100 or in a folded design; in still other embodiments, display 1105 can be a flexible display disposed on a curved surface or on a folded surface of terminal 1100. Even further, the display screen 1105 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display screen 1105 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.
Camera assembly 1106 is used to capture images or video. Optionally, camera assembly 1106 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1106 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
The audio circuitry 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing or inputting the electric signals to the radio frequency circuit 1104 to achieve voice communication. For stereo capture or noise reduction purposes, multiple microphones may be provided, each at a different location of terminal 1100. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1101 or the radio frequency circuit 1104 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1107 may also include a headphone jack.
Positioning component 1108 is used to locate the current geographic position of terminal 1100 for purposes of navigation or LBS (Location Based Service). The Positioning component 1108 may be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union galileo System.
Power supply 1109 is configured to provide power to various components within terminal 1100. The power supply 1109 may be alternating current, direct current, disposable or rechargeable. When the power supply 1109 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, terminal 1100 can also include one or more sensors 1110. The one or more sensors 1110 include, but are not limited to: acceleration sensor 1111, gyro sensor 1112, pressure sensor 1113, fingerprint sensor 1114, optical sensor 1115, and proximity sensor 1116.
Acceleration sensor 1111 may detect acceleration levels in three coordinate axes of a coordinate system established with terminal 1100. For example, the acceleration sensor 1111 may be configured to detect components of the gravitational acceleration in three coordinate axes. The processor 1101 may control the display screen 1105 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1111. The acceleration sensor 1111 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 1112 may detect a body direction and a rotation angle of the terminal 1100, and the gyro sensor 1112 may cooperate with the acceleration sensor 1111 to acquire a 3D motion of the user with respect to the terminal 1100. From the data collected by gyroscope sensor 1112, processor 1101 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
Pressure sensor 1113 may be disposed on a side bezel of terminal 1100 and/or underlying display screen 1105. When the pressure sensor 1113 is disposed on the side frame of the terminal 1100, the holding signal of the terminal 1100 from the user can be detected, and the processor 1101 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1113. When the pressure sensor 1113 is disposed at the lower layer of the display screen 1105, the processor 1101 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 1105. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 1114 is configured to collect a fingerprint of the user, and the processor 1101 identifies the user according to the fingerprint collected by the fingerprint sensor 1114, or the fingerprint sensor 1114 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the user is authorized by the processor 1101 to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 1114 may be disposed on the front, back, or side of terminal 1100. When a physical button or vendor Logo is provided on the terminal 1100, the fingerprint sensor 1114 may be integrated with the physical button or vendor Logo.
Optical sensor 1115 is used to collect ambient light intensity. In one embodiment, the processor 1101 may control the display brightness of the display screen 1105 based on the ambient light intensity collected by the optical sensor 1115. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1105 is increased; when the ambient light intensity is low, the display brightness of the display screen 1105 is reduced. In another embodiment, processor 1101 may also dynamically adjust the shooting parameters of camera assembly 1106 based on the ambient light intensity collected by optical sensor 1115.
Proximity sensor 1116, also referred to as a distance sensor, is typically disposed on a front panel of terminal 1100. Proximity sensor 1116 is used to capture the distance between the user and the front face of terminal 1100. In one embodiment, when the proximity sensor 1116 detects that the distance between the user and the front face of the terminal 1100 is gradually decreased, the display screen 1105 is controlled by the processor 1101 to switch from a bright screen state to a dark screen state; when the proximity sensor 1116 detects that the distance between the user and the front face of the terminal 1100 becomes progressively larger, the display screen 1105 is controlled by the processor 1101 to switch from a breath-screen state to a light-screen state.
Those skilled in the art will appreciate that the configuration shown in fig. 11 does not constitute a limitation of terminal 1100, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.
Fig. 12 is a schematic structural diagram of a server 1200 according to an embodiment of the present disclosure, where the server 1200 may generate a relatively large difference due to a difference in configuration or performance, and may include one or more processors (CPUs) 1201 and one or more memories 1202, where at least one program code is stored in the one or more memories 1202, and is loaded and executed by the one or more processors 1201 to implement the audio noise reduction methods provided by the foregoing method embodiments. Certainly, the server 1200 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 1200 may further include other components for implementing the functions of the device, which is not described herein again.
In an exemplary embodiment, a computer readable storage medium, such as a memory, including program code executable by a processor to perform the audio noise reduction method in the above embodiments is also provided. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product or a computer program is also provided, which comprises one or more program codes stored in a computer-readable storage medium. The one or more program codes can be read by one or more processors of the computer device from a computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the computer device can execute the above-described audio noise reduction method.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing is considered as illustrative of the embodiments of the disclosure and is not to be construed as limiting thereof, and any modifications, equivalents, improvements and the like made within the spirit and principle of the disclosure are intended to be included within the scope of the disclosure.

Claims (12)

1. A method for audio noise reduction, the method comprising:
extracting a first noise spectrum of the audio frame to be denoised according to the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised;
determining the voice existence probability of the first noise spectrum according to the proportion of high-frequency energy and low-frequency energy in the first noise spectrum;
extracting noise in the first noise spectrum according to the existence probability of the voice to obtain an adjusted first noise spectrum, wherein the adjusted first noise spectrum is the first noise spectrum without the voice;
processing the adjusted first noise spectrum based on a reverberation transfer function to obtain a second noise spectrum, wherein the second noise spectrum comprises signal attenuation components in the audio frame to be subjected to noise reduction;
and removing the noise in the audio frame to be subjected to noise reduction according to the second noise spectrum to obtain a target audio frame.
2. The method of claim 1, wherein the processing the first noise spectrum based on the reverberation transfer function to obtain a second noise spectrum comprises:
acquiring a product of the first noise spectrum and the reverberation transfer function, and taking the product as a third noise spectrum, wherein the third noise spectrum is a signal attenuation component in the audio frame to be denoised;
acquiring the sum of the first noise spectrum and the third noise spectrum, and taking the sum of the first noise spectrum and the third noise spectrum as the second noise spectrum.
3. The method of claim 1, wherein the obtaining of the reverberation transfer function comprises:
acquiring environmental information, and receiving the receiver position, the sound source position and the sound velocity of the audio frame to be denoised;
and determining the reverberation transfer function according to the environment information, the receiver position for receiving the audio frame to be denoised, the sound source position and the sound velocity.
4. The method of claim 1, wherein the extracting a first noise spectrum of the audio frame to be denoised according to a ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised comprises:
determining the noise existence probability of the audio frame to be denoised according to the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised;
and extracting a first noise spectrum of the audio frame to be denoised according to the noise existence probability.
5. The method according to claim 4, wherein the determining the noise existence probability of the audio frame to be denoised according to the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised comprises:
determining a first noise existence probability as a noise existence probability of the audio frame to be denoised in response to the fact that the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised is greater than a ratio threshold;
and determining a second noise existence probability as the noise existence probability of the audio frame to be subjected to noise reduction in response to the fact that the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be subjected to noise reduction is smaller than or equal to a ratio threshold, wherein the first noise existence probability is smaller than the second noise existence probability.
6. The method according to claim 1, wherein the determining of the ratio of low frequency energy and high frequency energy in the audio frame to be denoised comprises:
acquiring a low-frequency energy mean value and a high-frequency energy mean value in the audio frame to be denoised according to the frequency spectrum of the audio frame to be denoised;
and determining the ratio of the low-frequency energy mean value to the high-frequency energy mean value as the proportion of the high-frequency energy to the low-frequency energy in the audio frame to be denoised.
7. The method of claim 1, wherein determining the speech presence probability of the first noise spectrum based on the ratio of high frequency energy to low frequency energy in the first noise spectrum comprises:
and in response to the fact that the ratio of the high-frequency energy to the low-frequency energy in the first noise spectrum is larger than a target threshold, determining the voice existence probability corresponding to the high-frequency band as a first voice existence probability, and determining the voice existence probability corresponding to the low-frequency band as a second voice existence probability, wherein the first voice existence probability is smaller than the second voice existence probability.
8. The method of claim 1, wherein determining the ratio of high frequency energy to low frequency energy in the first noise spectrum comprises:
acquiring a medium-high frequency energy mean value and a low-frequency energy mean value of the first noise spectrum;
and taking the ratio of the high-frequency energy mean value to the low-frequency energy mean value as the proportion of the high-frequency energy to the low-frequency energy in the first noise spectrum.
9. The method of claim 1, wherein determining the ratio of high frequency energy to low frequency energy in the first noise spectrum comprises:
determining whether noise exists in each of a plurality of frequency points of the first noise spectrum;
and determining the proportion of high-frequency energy and low-frequency energy in the first noise spectrum according to the noise existence result of each frequency point.
10. An audio noise reduction apparatus, characterized in that the apparatus comprises a plurality of functional modules for performing the audio noise reduction method of any one of claims 1 to 9.
11. A computer device comprising one or more processors and one or more memories having at least one program code stored therein, the program code being loaded and executed by the one or more processors to implement the operations performed by the audio noise reduction method of any of claims 1 to 9.
12. A computer-readable storage medium having at least one program code stored therein, the program code being loaded and executed by a processor to perform operations performed by the audio noise reduction method of any of claims 1 to 9.
CN202011019647.0A 2020-09-24 2020-09-24 Audio noise reduction method, device, equipment and medium Active CN112233689B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011019647.0A CN112233689B (en) 2020-09-24 2020-09-24 Audio noise reduction method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011019647.0A CN112233689B (en) 2020-09-24 2020-09-24 Audio noise reduction method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN112233689A CN112233689A (en) 2021-01-15
CN112233689B true CN112233689B (en) 2022-04-08

Family

ID=74107555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011019647.0A Active CN112233689B (en) 2020-09-24 2020-09-24 Audio noise reduction method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN112233689B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113593604A (en) * 2021-07-22 2021-11-02 腾讯音乐娱乐科技(深圳)有限公司 Method, device and storage medium for detecting audio quality
CN114333874A (en) * 2021-11-22 2022-04-12 腾讯科技(深圳)有限公司 Method for processing audio signal
CN114176623B (en) * 2021-12-21 2023-09-12 深圳大学 Sound noise reduction method, system, noise reduction device and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005037650A (en) * 2003-07-14 2005-02-10 Asahi Kasei Corp Noise reducing apparatus
CN106448692A (en) * 2016-07-04 2017-02-22 Tcl集团股份有限公司 RETF reverberation elimination method and system optimized by use of voice existence probability
CN106488052A (en) * 2015-08-27 2017-03-08 成都鼎桥通信技术有限公司 One kind is uttered long and high-pitched sounds scene recognition method and equipment
CN109616139A (en) * 2018-12-25 2019-04-12 平安科技(深圳)有限公司 Pronunciation signal noise power spectral density estimation method and device
CN109712637A (en) * 2018-12-21 2019-05-03 珠海慧联科技有限公司 A kind of Reverberation Rejection system and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005037650A (en) * 2003-07-14 2005-02-10 Asahi Kasei Corp Noise reducing apparatus
CN106488052A (en) * 2015-08-27 2017-03-08 成都鼎桥通信技术有限公司 One kind is uttered long and high-pitched sounds scene recognition method and equipment
CN106448692A (en) * 2016-07-04 2017-02-22 Tcl集团股份有限公司 RETF reverberation elimination method and system optimized by use of voice existence probability
CN109712637A (en) * 2018-12-21 2019-05-03 珠海慧联科技有限公司 A kind of Reverberation Rejection system and method
CN109616139A (en) * 2018-12-25 2019-04-12 平安科技(深圳)有限公司 Pronunciation signal noise power spectral density estimation method and device

Also Published As

Publication number Publication date
CN112233689A (en) 2021-01-15

Similar Documents

Publication Publication Date Title
CN108615526B (en) Method, device, terminal and storage medium for detecting keywords in voice signal
CN111933112B (en) Awakening voice determination method, device, equipment and medium
CN112233689B (en) Audio noise reduction method, device, equipment and medium
CN109994127B (en) Audio detection method and device, electronic equipment and storage medium
CN111696532B (en) Speech recognition method, device, electronic equipment and storage medium
CN109887494B (en) Method and apparatus for reconstructing a speech signal
CN108335703B (en) Method and apparatus for determining accent position of audio data
CN110047468B (en) Speech recognition method, apparatus and storage medium
CN111696570B (en) Voice signal processing method, device, equipment and storage medium
CN109003621B (en) Audio processing method and device and storage medium
CN110970050B (en) Voice noise reduction method, device, equipment and medium
CN112233688B (en) Audio noise reduction method, device, equipment and medium
CN111613213B (en) Audio classification method, device, equipment and storage medium
CN111863020A (en) Voice signal processing method, device, equipment and storage medium
CN109961802B (en) Sound quality comparison method, device, electronic equipment and storage medium
CN111354378B (en) Voice endpoint detection method, device, equipment and computer storage medium
CN111341307A (en) Voice recognition method and device, electronic equipment and storage medium
CN113343709B (en) Method for training intention recognition model, method, device and equipment for intention recognition
CN113192531B (en) Method, terminal and storage medium for detecting whether audio is pure audio
CN112116908B (en) Wake-up audio determining method, device, equipment and storage medium
CN114897158A (en) Training method of data processing model, data processing method, device and equipment
CN113362836A (en) Vocoder training method, terminal and storage medium
CN112750449A (en) Echo cancellation method, device, terminal, server and storage medium
CN111508513A (en) Audio processing method and device and computer storage medium
CN114283827B (en) Audio dereverberation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant